Thesis Defense: Towards Object-based SLAM

Speaker

Yihao Zhang

MIT MechE

Host

John J. Leonard

MIT MechE

Abstract:
Simultaneous localization and mapping (SLAM) is a fundamental capability for a robot to perceive its surrounding environment. The research area has developed for more than two decades from the original sparse landmark-based SLAM to dense SLAM, and now there is a demand for semantic understanding of the environment beyond pure geometric understanding. This thesis makes a number of contributions to help realize object-based SLAM, in which the map consists of a set of objects with their semantic categories recognized and their poses and shapes estimated. Such a map provides vital object-level semantic and geometric perception for applications such as augmented reality (AR), mixed reality (MR), mobile manipulation, and autonomous driving.

In order to perform object-based SLAM, the sensor measurements have to undergo a series of processes. First, objects are semantically segmented in the sensor measurements. This step is typically done by a neural network. As robots are often required to bootstrap from some initial labeled datasets and adapt to different environments where labeled data are unavailable, it is important to enable semi-supervised learning to improve the robot performance with the unlabeled data collected by the robot itself. Second, after the objects are segmented, measurements for each object across different viewpoints have to be associated together for downstream processing. Lastly, the robot must be able to extract the object pose and shape information from the measurements without access to the detailed CAD models of the objects. This thesis studies these three aspects of object-based SLAM, namely semi-supervised learning of semantic segmentation in a robotics context, data association for object-based SLAM, and category-level object pose and shape estimation.

For category-level object pose and shape estimation, we developed ShapeICP (ICP: iterative closest point), an algorithm that does not use pose-annotated data and generates meshes as the object shape representation. For data association, we developed DAF-SLAM (DAF: data association free) to estimate the associations in the back-end instead of relying on sensor-dependent front-end methods. For semi-supervised learning, we applied temporal semantic consistency inspired by the photometric consistency technique in the traditional SLAM methods. Each contribution is evaluated via experimental datasets, demonstrating improvements over previous techniques.

Committee Members:
John J. Leonard (Advisor), Department of Mechanical Engineering
Faez Ahmed, Department of Mechanical Engineering
Nicholas Roy, Department of Aeronautics and Astronautics

Add to Calendar 2024-05-06 10:00:00 2024-05-06 11:30:00 America/New_York Thesis Defense: Towards Object-based SLAM Abstract:Simultaneous localization and mapping (SLAM) is a fundamental capability for a robot to perceive its surrounding environment. The research area has developed for more than two decades from the original sparse landmark-based SLAM to dense SLAM, and now there is a demand for semantic understanding of the environment beyond pure geometric understanding. This thesis makes a number of contributions to help realize object-based SLAM, in which the map consists of a set of objects with their semantic categories recognized and their poses and shapes estimated. Such a map provides vital object-level semantic and geometric perception for applications such as augmented reality (AR), mixed reality (MR), mobile manipulation, and autonomous driving.In order to perform object-based SLAM, the sensor measurements have to undergo a series of processes. First, objects are semantically segmented in the sensor measurements. This step is typically done by a neural network. As robots are often required to bootstrap from some initial labeled datasets and adapt to different environments where labeled data are unavailable, it is important to enable semi-supervised learning to improve the robot performance with the unlabeled data collected by the robot itself. Second, after the objects are segmented, measurements for each object across different viewpoints have to be associated together for downstream processing. Lastly, the robot must be able to extract the object pose and shape information from the measurements without access to the detailed CAD models of the objects. This thesis studies these three aspects of object-based SLAM, namely semi-supervised learning of semantic segmentation in a robotics context, data association for object-based SLAM, and category-level object pose and shape estimation.For category-level object pose and shape estimation, we developed ShapeICP (ICP: iterative closest point), an algorithm that does not use pose-annotated data and generates meshes as the object shape representation. For data association, we developed DAF-SLAM (DAF: data association free) to estimate the associations in the back-end instead of relying on sensor-dependent front-end methods. For semi-supervised learning, we applied temporal semantic consistency inspired by the photometric consistency technique in the traditional SLAM methods. Each contribution is evaluated via experimental datasets, demonstrating improvements over previous techniques.Committee Members:John J. Leonard (Advisor), Department of Mechanical EngineeringFaez Ahmed, Department of Mechanical EngineeringNicholas Roy, Department of Aeronautics and Astronautics 32-G882 (https://mit.zoom.us/j/92202523862)

Organizer & Contact

Yihao Zhang

yihaozh@mit.edu

Thesis Defense: Towards Object-based SLAM

Speaker

Host

May 06 2024

Location

Organizer & Contact

April 29

ML-enabled Genetic Analysis of High-Content Phenotypes

April 23

Visual Computing Seminar | Tim Brooks - Sora: Video Generation Models as World Simulators

Thesis Defense: Towards Object-based SLAM

Speaker

Host

May 06 2024

Location

Organizer & Contact

Related Events

April 29

ML-enabled Genetic Analysis of High-Content Phenotypes

April 23

Visual Computing Seminar | Tim Brooks - Sora: Video Generation Models as World Simulators