Skip to main content
Version: 7

Image Segmentation using Machine Learning

For this example, we will demonstrate image segmentation using machine learning with MoveIt Pro. ML-based image segmentation plays a crucial role in robotics by enabling precise perception of the environment. In object pose estimation, segmentation differentiates objects of interest from the background, which can be leveraged for 3D position and orientation in 3D space. This 3D pose estimation enables autonomous robotic grasping and manipulation in an unstructured scene.


MoveIt Pro offers several Behaviors for image segmentation using ML models. The GetMasks2DFromPointQuery Behavior segments images using point prompts. Point prompts are user-defined spatial cues that guide segmentation by indicating object locations. These prompts help refine masks, especially in ambiguous or complex scenes. The GetMasks2DFromTextQuery Behavior segments images with text prompts. Text prompts allow segmentation based on natural language descriptions, enabling flexible and intuitive object identification without manual annotations. These prompts allow for coarse masking of general object descriptions. Image inputs are given with a sensor_msgs/msg/Image message, and mask outputs are returned in a moveit_studio_vision_msgs::msg::Mask2D message, which is defined in the MoveIt Pro SDK. The unique strengths of each segmentation behavior make them suitable for different applications.

Launch MoveIt Pro

We assume you have already installed MoveIt Pro to the default install location. Launch the application using:

moveit_pro run -c lab_sim

Performing 2D Image Segmentation

Once you have your robot config running, you can create a simple Objective in MoveIt Pro that moves to a predefined location and performs segmentation.

  • MoveToWaypoint (or equivalent) to move to the predefined location.
  • GetImage to get the latest RGB image message from a camera stream.
  • GetPointsFromUser to interactively get a point prompt.
  • GetMasks2DFromTextQuery to segment the image.
  • PublishMask2D to visualize the masks.

To run an example, execute the Segment Image from Text Prompt Objective. Configure the views to display the annotated image topic, you will see the segmentation results in the UI.

ML Segmentation 2D

Note that the GetMasks2DFromTextQuery Behavior has additional options to filter detections, which may require tuning for your specified application. Feel free to change the detection options to see how the results are affected. The behavior uses provided positive and negative text prompts to predict the probability of a mask. The probability threshold can also be customized for different segmentation results. For optimal results, masks from GetMasks2DFromTextQuery can be refined by extracting their center points with GetCenterFromMask2D and feeding them into GetMasks2DFromPointQuery. The masks from the point prompt behavior will be higher quality than those from the text prompt behavior. The text prompts give a coarse estimate of the mask, while the point prompt yields nearly exact masks.

Extracting 3D Masks and Fitting Geometric Shapes

The segmentation Behaviors outputs a list of masks, of ROS message type moveit_studio_vision_msgs/msg/Mask2D. Many other Behaviors in MoveIt Pro can consume masks in this format for further processing.

For example, we can extend our Objective to convert the 2D segmentation masks to 3D point cloud segments by using the following Behaviors.

  • GetPointCloud and GetCameraInfo to get the necessary information for 2D to 3D segmentation correspondence.
  • GetMasks3DFromMasks2D, which accepts the 2D masks, point cloud, and camera info to produce a set of 3D masks.
  • ForEachMask3D to loop through each of the detected masks.
  • GetPointCloudFromMask3D to get a point cloud fragment corresponding to a 3D mask.
  • SendPointCloudToUI to visualize each point cloud segment above in the UI.

To run an example, execute the Segment Image from Clicked Point Objective. Configure the views to display the annotated image topic, you will see the segmentation results in the UI.

ML Segmentation 3D

You can additionally extract graspable objects from the 3D masks and fit geometric shapes, using the following Behaviors.

  • GetGraspableObjectsFromMasks3D to convert the 3D mask representations to graspable object representations, which include a cuboid bounding volume by default.
  • ForEachGraspableObject to loop through each of the graspable objects.
  • ModifyObjectInPlanningScene to visualize each graspable object (and its corresponding geometry) in the UI.

Grasping an Object from a Text Prompt

We can also generate grasp poses directly from the point cloud fragments in addition to creating graspable objects. For each fragment, we can extract the centroid pose and use motion planning to reach the desired pose.

  • ForEach and AddPointCloudToVector to collect a vector of the point cloud fragment
  • ForEach and GetCentroidFromPointCloud to loop over all of the fragments and generate centroid poses
  • Plan Move To Pose or equivalent to plan to the target pose

To run this example, execute the Grasp Object from Text Prompt Objective.

Next Steps

Once you have detected 3D objects from 2D image segmentation, you can use the poses and shape of the detected objects for motion planning tasks. Some examples include pushing buttons, opening doors, or performing inspection paths around objects of interest.