Version: 9

Multimodal Segmentation with SAM3

Required Version

This feature will not be released until MoveIt Pro version 9.1.

MoveIt Pro supports SAM2-based segmentation with point prompts (GetMasks2DFromPointQuery) and CLIPSeg-based text prompts, covered in the ML Image Segmentation guide. SAM3 (Segment Anything Model 3 with Detection) replaces CLIPSeg with higher quality masks and adds multimodal prompts through the GetMasks2DFromExemplar Behavior. With SAM3 you can:

Segment objects from text prompts with better mask quality than CLIPSeg
Find objects matching a visual example ("find things that look like this") using image exemplars
Narrow results when text alone is too ambiguous (e.g., combine a text prompt with an exemplar to find only the square bottles)
Combine text, image exemplars, and bounding boxes in a single inference pass

What is SAM3?

SAM3 uses a 4-model architecture (compared to SAM2's 3 models):

Vision Encoder — Processes the input image into multi-scale feature maps
Text Encoder — Converts a text prompt into features the decoder can use
Geometry Encoder — Converts bounding box prompts (with image features) into spatial features
Decoder — Combines all features to predict segmentation masks, bounding boxes, and confidence scores

The key difference from SAM2 is that SAM3 has native text understanding and a separate geometry encoder, allowing it to combine text, boxes, and visual features in a single inference pass.

SAM3 Inference Pipeline

Prompt Types

SAM3 supports three prompt types that can be used individually or combined:

Text Prompts

A short noun phrase describing what to find. For example, "an object", "bottles", or "square pill bottle". Text prompts are the simplest way to get started and work well for common objects. Note that SAM3 text prompts work best as noun phrases describing object appearance (e.g., "a player in white"), not spatial relationships or instructions.

Text prompt segmentation on a parts bin

Segmentation results using the text prompt "triangular grey part with holes".

Bounding Boxes

Unlike SAM2 where a bounding box segments everything inside it, SAM3 bounding boxes act as visual prompts — the model finds objects similar to what's inside the box elsewhere in the image. This makes a bounding box on the image itself equivalent to an in-image exemplar: "find more things like what's in this box." You can create boxes using the CreateBoundingBox2D, CreateBoundingBoxes2D, or CreateBoundingBoxFromOffset Behaviors.

In-image bounding box finding similar objects

A bounding box drawn around a single weed (red box) prompts SAM3 to find similar weeds throughout the image.

Image Exemplars

A reference image showing what you want to find, paired with bounding boxes identifying the object(s) of interest within it. The model builds a combined image with your target scene and the exemplar side-by-side, using the bounding boxes to understand what to look for. The exemplar can be a single object with one bounding box, or a composite image with multiple bounding boxes — as long as all boxes represent the same kind of thing. This is powerful when the object is hard to describe in words but easy to show.

Image exemplar with multiple bounding boxes finding brackets in a bin

Combining Prompts

For the best results, combine text and exemplar prompts together. The text gives semantic context ("what kind of thing") while the exemplar gives visual specifics ("what it looks like").

Launch MoveIt Pro

We assume you have already installed MoveIt Pro to the default install location. Launch the application using:

moveit_pro run -c lab_sim

Text Prompt Segmentation

The simplest way to use SAM3 is with a text prompt. The "ML Find Objects on Table" Objective demonstrates this.

The Behavior Tree flow is:

Look at Table — Moves the arm so the wrist camera faces the table
Take Wrist Camera Image — Captures an RGB image from the wrist camera
GetMasks2DFromExemplar — Runs SAM3 inference with a text prompt
PublishMask2D — Visualizes the detected masks

ML Find Objects on Table Behavior Tree

The key port values for GetMasks2DFromExemplar are:

GetMasks2DFromExemplar:
- target_image={wrist_camera_image}
- text_prompt=an object
- confidence_threshold=0.5
- model_package=lab_sim
- encoder_model_path=models/sam3_vision_encoder.onnx
- decoder_model_path=models/sam3_decoder.onnx
- geometry_encoder_model_path=models/sam3_geometry_encoder.onnx
- text_encoder_model_path=models/sam3_text_encoder.onnx

Run the Objective and check the /masks_visualization topic to see the detected objects with their confidence scores.

Text prompt segmentation results

note

SAM3's text encoder accepts a maximum of 32 tokens (30 content tokens plus start/end tokens). You can fit a fairly detailed description — for example, "grey triangular part with three large holes with bosses and five small holes and three subtle depressions near each of the three corners and small several slots a" is exactly 30 content tokens. Keep prompts short and specific for best results.

Try changing the text_prompt to be more specific (e.g., "bottles" or "pill bottles") and observe how the results change. More specific prompts generally yield fewer but more relevant detections.

Testing Prompts Without the Robot

Before tuning prompts on a live robot, you can iterate faster by loading images from files. The "ML Segment Bottles from File" Objective demonstrates this.

ML Segment Bottles from File Behavior Tree

LoadImageFromFile:
- file_path=objectives/real_lab1.png
- package_name=lab_sim
- image={image}

GetMasks2DFromExemplar:
- target_image={image}
- text_prompt=bottles
- confidence_threshold=0.5

This loads a static image of a laboratory table and runs SAM3 on it. Since no robot motion is involved, you can quickly experiment with different text prompts and confidence thresholds to find what works best for your objects.

Segmentation from file

Image Exemplar Segmentation

When text prompts aren't specific enough, image exemplars let you show the model exactly what you're looking for. The "ML Find Bottles on Table from Image Exemplar" Objective demonstrates this.

Preparing an Exemplar

An exemplar is a cropped photo of the object you want to find. For best results:

Crop tightly around the object with minimal background
Use a real photo, not a CAD render
The exemplar doesn't need to be from the same camera or lighting conditions

The lab_sim config includes an example exemplar at src/lab_sim/objectives/square_bottle1.png — a photo of a square pill bottle.

Building the Objective

The Behavior Tree flow is:

LoadImageFromFile — Loads the exemplar image
CreateBoundingBoxFromOffset — Generates a bounding box covering the exemplar (with optional padding)
Move to Waypoint — Positions the camera to see the objects
Take Wrist Camera Image — Captures the target scene
GetMasks2DFromExemplar — Runs SAM3 with both the exemplar and bounding boxes
PublishMask2D — Visualizes detected masks
PublishBoundingBoxes2D — Visualizes the exemplar with its bounding box for debugging

ML Find Bottles on Table from Image Exemplar Behavior Tree

The key port values are:

LoadImageFromFile:
- file_path=objectives/square_bottle1.png
- package_name=lab_sim
- image={square_bottle_exemplar}

CreateBoundingBoxFromOffset:
- exemplar_image={square_bottle_exemplar}
- bounding_boxes={exemplar_bboxes}
- padding_percent=0.07

GetMasks2DFromExemplar:
- target_image={wrist_camera_image}
- exemplar_image={square_bottle_exemplar}
- exemplar_bboxes={exemplar_bboxes}
- confidence_threshold=0.45

Notice that the exemplar_image and exemplar_bboxes ports are now provided. The Behavior automatically constructs a combined image (target + exemplar side-by-side) for inference.

Exemplar segmentation results

How Exemplar Inference Works

Under the hood, the GetMasks2DFromExemplar Behavior:

Resizes the exemplar to match the target image height (preserving aspect ratio)
Concatenates the target and exemplar images side-by-side into a single combined image
Scales and offsets the exemplar bounding boxes to their position in the combined image
Runs SAM3 inference on the combined image with the geometry prompts
Crops the output masks back to only the target image region

This means the model sees both the exemplar and the target in a single pass, allowing it to match visual features between them.

Combining Text and Exemplar Prompts

For the most robust detection, provide both a text prompt and an exemplar image:

GetMasks2DFromExemplar:
- target_image={wrist_camera_image}
- text_prompt=square bottle
- exemplar_image={square_bottle_exemplar}
- exemplar_bboxes={exemplar_bboxes}
- confidence_threshold=0.45

The text prompt provides semantic guidance while the exemplar provides visual specificity.

Debugging with Bounding Box Visualization

The PublishBoundingBoxes2D Behavior publishes the exemplar image with its bounding boxes overlaid to the /bboxes_visualization topic. Use this to verify that the bounding boxes are correctly positioned on the exemplar — if they're off, the model will get incorrect geometry prompts.

Bounding box visualization on exemplar

Tuning Parameters

Confidence Threshold

The confidence_threshold parameter filters detections by their confidence score (0.0 to 1.0).

0.5: Good starting point for text-only prompts (the port default is 0.0, which keeps all masks)
0.45: Slightly more permissive, good for exemplar-based detection
0.3: Catches more objects but may include false positives
0.7: Strict, only high-confidence detections

Bounding Box Padding

The padding_percent parameter on CreateBoundingBoxFromOffset insets the bounding box from the exemplar image borders. The value is the fraction of each dimension trimmed from each edge (range [0.0, 0.5)).

0.0: Box covers the full exemplar image exactly
0.05 (default): Trims 5% from each edge, box is 90% of the image
0.15: Trims 15% from each edge, box is 70% of the image

Higher values crop further into the object, which can sometimes improve results — it forces the model to focus on the object's core features rather than edges or background that may bleed into the exemplar image. If your exemplar includes background around the object, increasing the inset helps exclude it. Experiment with different values for your objects.

Model Package

The model_package port specifies which ROS package contains the ONNX model files. The models are expected under a models/ directory within that package. The lab_sim package ships with SAM3 models included.

Next Steps

SAM3 and SAM2 serve different scenarios. SAM3 handles text prompts, image exemplars, and bounding boxes, while SAM2 provides interactive point-based segmentation (e.g., clicking on objects). Both produce high-quality masks and feed into the same downstream pipeline. SAM3 also replaces CLIPSeg for text-based segmentation with significantly better mask quality, though CLIPSeg may still be faster on some hardware.

ML Image Segmentation — Convert 2D masks to 3D point cloud segments and fit geometric shapes
ML Grasping — Use segmented point clouds for ML-based grasp planning
ML Automasking — Automatically segment all objects in a scene without prompts (uses SAM2)

Technical Reference

SAM3 vs SAM2

Aspect	SAM2	SAM2 Automasking	SAM3	CLIPSeg
Prompt types	Points + Boxes	None (grid-based)	Text + Boxes + Exemplars	Text
Text support	No	No	Yes	Yes
Models	3 (encoder, prompt encoder, decoder)	3 (same as SAM2)	4 (vision, text, geometry, decoder)	2 (CLIP encoder, CLIPSeg decoder)
Input resolution	1024x1024	1024x1024	1008x1008	352x352
Best for	Interactive point-click segmentation	Promptless scene discovery	Flexible multimodal detection	Legacy text segmentation

ONNX Model Files

SAM3 uses four ONNX model files:

sam3_vision_encoder.onnx — Image encoding (largest model)
sam3_text_encoder.onnx — Text prompt encoding
sam3_geometry_encoder.onnx — Bounding box encoding
sam3_decoder.onnx — Mask prediction

GPU Recommendations

SAM3 benefits significantly from GPU acceleration. On CPU-only machines, inference will be slower but functional (ONNX Runtime uses all available cores). A warning will appear in the UI log panel when running without a GPU.

What is SAM3?​

Prompt Types​

Text Prompts​

Bounding Boxes​

Image Exemplars​

Combining Prompts​

Text Prompt Segmentation​

Testing Prompts Without the Robot​

Image Exemplar Segmentation​

Preparing an Exemplar​

Building the Objective​

How Exemplar Inference Works​

Combining Text and Exemplar Prompts​

Debugging with Bounding Box Visualization​

Tuning Parameters​

Confidence Threshold​

Bounding Box Padding​

Model Package​

Next Steps​

Technical Reference​

SAM3 vs SAM2​

ONNX Model Files​

GPU Recommendations​