Skip to main content
Version: 9

Integrate with Google Gemini

Required Version
This feature will not be released until MoveIt Pro version 9.3.

This guide explains how to use Google Gemini from inside a MoveIt Pro Behavior Tree to perform high-level visual reasoning over camera images.

Launch MoveIt Pro

We assume you have already installed MoveIt Pro to the default install location. Launch the application using:

moveit_pro run -c lab_sim

What is Google Gemini?

Google Gemini is a family of multimodal large language models from Google. The Gemini API accepts a text prompt together with one or more images and returns a text response. The same model can answer freeform questions about an image, identify objects, count items, describe a scene, or return structured data such as 2D pixel coordinates of features it locates.

In a robotics context this is useful when a task is too open-ended for a classical perception pipeline. Examples include:

  • High-level scene reasoning: "Is the door open or closed?", "Which drawer is empty?", "Has the bin been emptied?".
  • Open-vocabulary localization: "Find every loose bolt on the panel", "Locate the red emergency button", "Point to all the green tomatoes".
  • Inspection and triage: "List any defects you see in this part", "Which of these bottles is mislabeled?".

For perception tasks that need pixel- or sub-pixel-level precision, Gemini is best used as a coarse first stage that hands off to dedicated vision Behaviors (e.g. mask refinement, pose estimation).

Prerequisites

To use Gemini from MoveIt Pro you need a Google Gemini API key.

  1. Outbound internet access. The backend container must be able to reach generativelanguage.googleapis.com over HTTPS. Air-gapped or tightly firewalled deployments cannot use this Behavior; choose a local alternative instead.

  2. Follow the Get a Gemini API key instructions to create a key in Google AI Studio. The free tier supports vision-capable models (e.g. gemini-2.5-flash) at low rate limits, which is sufficient for prototyping.

  3. Make the key available to the MoveIt Pro backend as an environment variable named GOOGLE_GEMINI_API_KEY. The MoveIt Pro production stack passes this variable through from the host into the container automatically, so there are two equivalent ways to set it:

    • Production / persistent: add the key to your .env file alongside MOVEIT_LICENSE_KEY:

      GOOGLE_GEMINI_API_KEY=your-key-here
    • One-off / development: export it in the shell where you run moveit_pro run:

      export GOOGLE_GEMINI_API_KEY="your-key-here"
    Treat the API key as a secret

    Do not commit your API key to source control. Keep it in .env (which is gitignored) or in a secrets manager.

  4. Review Google's pricing and rate limits for the model you plan to use. The free tier is rate-limited to a small number of requests per minute, so design Behavior Trees that call Gemini sparingly rather than every tick.

The GetPoints2DFromGeminiQuery Behavior

MoveIt Pro ships with the GetPoints2DFromGeminiQuery Behavior, which sends a text prompt and a ROS image to Gemini and parses the response as a list of normalized 2D image points.

Data Port NameDirectionTypeDescription
promptinputstd::stringThe user-facing prompt. Should ask Gemini to find, locate, or point to features in the image.
imageinputsensor_msgs::msg::ImageThe image to send to Gemini. JPEG-encoded internally before transmission.
model_nameinputstd::string (optional)Gemini model identifier. Defaults to gemini-2.5-flash.
save_debug_imageinputstd::string (optional)Filesystem path. If non-empty, an annotated copy of the input image with detected points is written there.
responseoutputstd::stringThe raw text response returned by Gemini.
detected_pointsoutputstd::vector<geometry_msgs::msg::PointStamped>Normalized 2D points parsed from the response. Each point's x and y are in [0, 1]; z is always 0.

The Behavior automatically appends a structured-output suffix to your prompt instructing Gemini to return a JSON object of the form {"points": [{"x": 0.52, "y": 0.31, "label": "..."}, ...]}. You write your task in natural language and the Behavior takes care of the response format.

The header (frame ID and timestamp) of each PointStamped is copied from the input image, so downstream Behaviors can transform the points into other frames using the camera intrinsics and any TF tree the image was captured against.

Writing prompts that produce points

Because the appended suffix asks Gemini for image locations, the prompt must describe something Gemini can answer with 2D points. For example:

  • "Find every screw head on this panel."
  • "Locate the red and green LEDs."
  • "Point to each ripe tomato."

For example, the 2D points in the image below were generated by prompting Gemini how to pick up every individual item using a suction cup gripper:

Gemini detected points example

Prompts that do not ask for locations (for example, "Describe this image" or "Count the objects") will still execute and the raw text comes back on the response port, but detected_points will be empty and the Behavior logs a warning.

Using the Behavior in a Behavior Tree

A typical Behavior Tree using GetPoints2DFromGeminiQuery looks like:

  1. MoveToWaypoint — move the robot so the camera sees the workspace.
  2. GetImage — grab the latest RGB image from the camera topic.
  3. GetPoints2DFromGeminiQuery — send the image and your prompt to Gemini.
  4. Downstream Behaviors that consume the detected_points vector. Because the points are normalized to [0, 1], multiply by the image dimensions to convert to pixel coordinates, then optionally back-project into 3D using a depth image and GetCameraInfo. Some Behaviors, like GetPoseFromPixelCoords, take normalized pixel coordinates directly.

For debugging, set the save_debug_image port to a path such as /tmp/gemini_overlay.jpg. Each tick will overwrite that file with the input image annotated with red circles at each detected point, plus the labels Gemini assigned. This is helpful while iterating on prompts.

Latency and rate limits

Each call to Gemini takes between roughly one and several seconds depending on image size, prompt length, and the selected model. The Behavior is asynchronous and returns RUNNING until the request completes, so it does not block the rest of the Behavior Tree from being ticked.

Because of API rate limits, do not call this Behavior on every tick of a high-frequency loop. Use it inline at decision points, or behind a ForEach/Repeat decorator that throttles invocations.

Implementing your own Gemini-based Behavior

The GetPoints2DFromGeminiQuery Behavior is built on top of a small, reusable C++ utility namespace that you can use to build other Gemini-backed Behaviors with different response formats — for example, returning bounding boxes, free-text classifications, or polygon outlines.

The utilities live in the moveit_pro_behavior ROS package under the header moveit_pro_behavior/utils/gemini_utils.hpp and the moveit_pro::behaviors::gemini namespace. Add moveit_pro_behavior to your package's find_package calls and link against the gemini_utils CMake target to use them.

Sending a request and parsing the response

The end-to-end flow for a custom Gemini-backed Behavior is just three calls into gemini_utils:

#include "moveit_pro_behavior/utils/gemini_utils.hpp"

namespace gemini = moveit_pro::behaviors::gemini;

// 1. Build the multimodal JSON request body from a prompt and a JPEG image buffer.
const std::string body = gemini::buildGeminiRequestBody(prompt, gemini::encodeToBase64(jpeg_bytes));

// 2. POST the request. Reads GOOGLE_GEMINI_API_KEY from the environment, constructs the URL,
// and returns the raw HTTP response body or an error.
const auto http_response = gemini::postJsonToGemini("gemini-2.5-flash", body);
if (!http_response.has_value())
{
return tl::make_unexpected(http_response.error());
}

// 3. Extract the model's text content. The shape of this text depends entirely on the prompt
// you sent.
const auto text = gemini::parseGeminiResponse(*http_response);

Tips for writing a custom response parser

Gemini's text output is mostly structured but rarely perfectly so. When writing your own parser, plan for the following:

  • Markdown code fences. Gemini frequently wraps JSON in ```json ... ``` even when told not to. Strip the fence before passing the contents to a JSON parser.
  • Surrounding prose. Gemini may add a short explanation before or after the JSON. Locate the outermost {...} (or [...]) by searching for the first { and last } rather than parsing the whole response.
  • Slight format drift. Coordinates that should be in [0, 1] may come back as 1.01 or -0.001. Floats that should be integers may come back as numeric strings. Decide whether your parser clamps, coerces, or rejects.
  • Refusals and safety filters. Gemini may refuse to answer; the response will still be valid JSON but will contain an error or an empty candidates array. Detect these and surface them as Behavior failures.
  • Best-effort vs strict. Decide upfront whether parse failures should fail the Behavior or just leave the relevant output port empty. The GetPoints2DFromGeminiQuery Behavior takes the latter approach: the raw response port is always populated even if the structured detected_points parser fails.
  • Prompt for the schema you want to parse. The more concrete the schema in the prompt (with example values and explicit field names), the more reliable Gemini's output. Always include "Return ONLY valid JSON" in the prompt instructions.