Guide: How to use Vision

A step-by-step guide to using Vision capabilities from a Vision model using Langbase Pipes.


In this guide, we will learn how to send an image to a Vision model in a Langbase Pipe and get it to answer questions about it.

What is Vision?

LLM models with vision capabilities can take images as input, understand them, and generate text-based answers about them. Vision models can be used to answer questions about images, generate captions, or provide descriptions of visual content. Vision is also used for OCR tasks, like image classification, and object detection.

LLM TypeInputOutput
Unimodal without VisionTextText
Multimodal with VisionText + ImageText

Let's say we send the following image to a Vision model and ask it to describe the image.

The Vision model will process the image and generate a text-based response like this.

The image depicts an iridescent green sweat bee, likely of the genus Agapostemon or Augochlorini.

In the image, the bee is perched on a flower, likely foraging for nectar or pollen, which is a common behavior for these pollinators.

How to use Vision in Langbase Pipes?

Vision is supported in Langbase Pipes across different LLM providers, including OpenAI, Anthropic, Google, and more. Using Vision in Langbase Pipes is simple. You can send images in the API request and get text answers about them.

Sending Images to Pipe for Vision

First, select a Vision model that supports image input in your Langbase Pipe. You can choose from a variety of Vision models from different LLM providers. For example, OpenAI's gpt-4o or Anthropic's claude-3.5-sonnet.

The Pipe Run API matches the OpenAI spec for Vision requests. When running the pipe, provide the image in a message inside the messages array.

Here is what your messages will look like for vision requests:

// Pipe Run API
"messages": [
	{
		"role": "user",
		"content": [
			{
				"type": "text",
				"text": "What is in this image?"
			}
			{
				"type": "image_url",
				"image_url": {
					"url": "...xyz" // base64 encoded image
				}
			}
		]
	}
]

In the above example, we are sending an image URL (base64 encoded image) as input to the vision model pipe, which will process the image and give a text response.

Follow the Run Pipe API spec for detailed request types.

Image Input Guidelines for Vision

Here are some considerations when using vision in Langbase Pipes:

  1. Message Format

    • Images can be passed in user role messages.
    • Message content must be an array of content parts (for text and images) in vision requests. While in text-only requests, the message content is a string.
  2. Image URL

    • The image_url field is used to pass the image URL, which can be:
      1. Base64 encoded images: Supported by all providers.
      2. Public URLs Supported only by OpenAI.
  3. Provider-specific limits

    • Different LLM providers may impose varying restrictions on image size, format, and the number of images per request.
    • Refer to the specific provider’s documentation for precise limits.
    • Langbase imposes no additional restrictions.
  4. Image Quality Settings (OpenAI only)

    • OpenAI models support an optional detail field in the image_url object for controlling image quality.
    • The detail field can be set to low, medium, or high to control the quality of the image sent to the model.

Examples

Here are some example Pipe Run requests utilizing Vision models in Langbase Pipes.

Example 1: Sending a Base64 Image

Here is an example of sending a base64 image in a Pipe Run API request.

curl https://api.langbase.com/v1/pipes/run \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_PIPE_API_KEY>' \
-d '{
  "stream": false,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe this image."
        },
        {
          "type": "image_url",
          "image_url": {
			{/* An example image of colorful squares */}
            "url": ""
          }
        }
      ]
    }
  ]
}'

Example 2: Sending Image as a Public Image URL (supported by OpenAI only)

Public image URLs are only supported by OpenAI, so make sure you are using an OpenAI model.

curl https://api.langbase.com/v1/pipes/run \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_PIPE_API_KEY>' \
-d '{
  "stream": false,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe this image."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/b/b5/Iridescent.green.sweat.bee1.jpg"
          }
        }
      ]
    }
  ]
}'

Example 3: Sending multiple images

You can also send multiple images attached to the same message.

curl https://api.langbase.com/v1/pipes/run \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_PIPE_API_KEY>' \
-d '{
  "stream": false,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "How are these images different?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "<image_1_base64>"
          }
        },
		{
          "type": "image_url",
          "image_url": {
            "url": "<image_2_base64>"
          }
        }
      ]
    }
  ]
}'

Replace <image_1_base64> and <image_2_base64> with the base64 encoded images you want to send.

Example 4: Sending multiple images in conversation turns (chat)

You can also send multiple images in different messages across conversation turns.

Let's say you start the conversation with the first image:

curl https://api.langbase.com/v1/pipes/run \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_PIPE_API_KEY>' \
-d '{
  "stream": false,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe this image."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "<image_1_base64>"
          }
        }
      ]
    }
  ]
}'

Then, in the next turn, you can send the second image:

curl https://api.langbase.com/v1/pipes/run \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_PIPE_API_KEY>' \
-d '{
  "stream": false,
  "thread_id": "<thread_id>",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Is this image different from the previous one?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "<image_2_base64>"
          }
        }
      ]
    }
  ]
}'

By including the thread_id returned from the first request, in the second request, your Langbase Pipe automatically continues the conversation from the previous turn.

FAQs

  • Make sure to use the correct Vision model that supports image input.
  • Langbase currently supports Vision models from OpenAI, Anthropic and Google. More providers will be supported soon.
  • Vision support is live on Langbase API. Vision in Studio playground is coming soon.
  • Langbase currently does not store images sent to Vision models.