Grounding with Google Search is now available! Learn more

Explore vision capabilities with the Gemini API

The Gemini API can run inference on images and videos passed to it. When passed an image, a series of images, or a video, Gemini can:

Describe or answer questions about the content
Summarize the content
Extrapolate from the content

This tutorial demonstrates some possible ways to prompt the Gemini API with images and video input. All output is text-only.

What's next

This guide shows how to upload image and video files using the File API and then generate text outputs from image and video inputs. To learn more, see the following resources:

File prompting strategies: The Gemini API supports prompting with text, image, audio, and video data, also known as multimodal prompting.
System instructions: System instructions let you steer the behavior of the model based on your specific needs and use cases.
Safety guidance: Sometimes generative AI models produce unexpected outputs, such as outputs that are inaccurate, biased, or offensive. Post-processing and human evaluation are essential to limit the risk of harm from such outputs.