Image Segmentation using Machine Learning
For this example, we will demonstrate image segmentation using machine learning with MoveIt Pro. ML-based image segmentation plays a crucial role in robotics by enabling precise perception of the environment. In object pose estimation, segmentation differentiates objects of interest from the background, which can be leveraged for 3D position and orientation in 3D space. This 3D pose estimation enables autonomous robotic grasping and manipulation in an unstructured scene.
Setup
MoveIt Pro offers several Behaviors for image segmentation using ML models.
The GetMasks2DFromPointQuery
Behavior segments images using point prompts.
Point prompts are user-defined spatial cues that guide segmentation by indicating object locations.
These prompts help refine masks, especially in ambiguous or complex scenes.
The GetMasks2DFromTextQuery
Behavior segments images with text prompts.
Text prompts allow segmentation based on natural language descriptions, enabling flexible and intuitive object identification without manual annotations.
These prompts allow for coarse masking of general object descriptions.
Image inputs are given with a sensor_msgs/msg/Image
message, and mask outputs are returned in a moveit_studio_vision_msgs::msg::Mask2D
message, which is defined in the MoveIt Pro SDK.
The unique strengths of each segmentation behavior make them suitable for different applications.
Launch MoveIt Pro
We assume you have already installed MoveIt Pro to the default install location. Launch the application using:
moveit_pro run -c lab_sim
Performing 2D Image Segmentation
Once you have your robot config running, you can create a simple Objective in MoveIt Pro that moves to a predefined location and performs segmentation.
MoveToWaypoint
(or equivalent) to move to the predefined location.GetImage
to get the latest RGB image message from a camera stream.GetPointsFromUser
to interactively get a point prompt.GetMasks2DFromTextQuery
to segment the image.PublishMask2D
to visualize the masks.
To run an example, execute the Segment Image from Text Prompt
Objective.
Configure the views to display the annotated image topic, you will see the segmentation results in the UI.
Note that the GetMasks2DFromTextQuery
Behavior has additional options to filter detections, which may require tuning for your specified application.
Feel free to change the detection options to see how the results are affected.
The behavior uses provided positive and negative text prompts to predict the probability of a mask.
The probability threshold can also be customized for different segmentation results.
For optimal results, masks from GetMasks2DFromTextQuery
can be refined by extracting their center points with GetCenterFromMask2D
and feeding them into GetMasks2DFromPointQuery
.
The masks from the point prompt behavior will be higher quality than those from the text prompt behavior.
The text prompts give a coarse estimate of the mask, while the point prompt yields nearly exact masks.
Extracting 3D Masks and Fitting Geometric Shapes
The segmentation Behaviors outputs a list of masks, of ROS message type moveit_studio_vision_msgs/msg/Mask2D
.
Many other Behaviors in MoveIt Pro can consume masks in this format for further processing.
For example, we can extend our Objective to convert the 2D segmentation masks to 3D point cloud segments by using the following Behaviors.
GetPointCloud
andGetCameraInfo
to get the necessary information for 2D to 3D segmentation correspondence.GetMasks3DFromMasks2D
, which accepts the 2D masks, point cloud, and camera info to produce a set of 3D masks.ForEachMask3D
to loop through each of the detected masks.GetPointCloudFromMask3D
to get a point cloud fragment corresponding to a 3D mask.SendPointCloudToUI
to visualize each point cloud segment above in the UI.
To run an example, execute the Segment Image from Clicked Point
Objective.
Configure the views to display the annotated image topic, you will see the segmentation results in the UI.
You can additionally extract graspable objects from the 3D masks and fit geometric shapes, using the following Behaviors.
GetGraspableObjectsFromMasks3D
to convert the 3D mask representations to graspable object representations, which include a cuboid bounding volume by default.ForEachGraspableObject
to loop through each of the graspable objects.ModifyObjectInPlanningScene
to visualize each graspable object (and its corresponding geometry) in the UI.
Grasping an Object from a Text Prompt
We can also generate grasp poses directly from the point cloud fragments in addition to creating graspable objects. For each fragment, we can extract the centroid pose and use motion planning to reach the desired pose.
ForEach
andAddPointCloudToVector
to collect a vector of the point cloud fragmentForEach
andGetCentroidFromPointCloud
to loop over all of the fragments and generate centroid posesPlan Move To Pose
or equivalent to plan to the target pose
To run this example, execute the Grasp Object from Text Prompt
Objective.
Next Steps
Once you have detected 3D objects from 2D image segmentation, you can use the poses and shape of the detected objects for motion planning tasks. Some examples include pushing buttons, opening doors, or performing inspection paths around objects of interest.