Multi-modal Messages

Summary

Problem Statement

Current AG-UI protocol only supports text-based user messages. As LLMs increasingly support multimodal inputs (images, audio, files), the protocol needs to evolve to handle these richer input types.

Motivation

Evolve AG-UI to support multimodal input messages without breaking existing apps. Inputs may include text, images, audio, and files.

Status

Status: Implemented — October 16, 2025
Author(s): Markus Ecker ([email protected])

Detailed Specification

Overview

Extend the UserMessage content property to be either a string or an array of InputContent:

interface TextInputContent {
  type: "text"
  text: string
}

interface BinaryInputContent {
  type: "binary"
  mimeType: string
  id?: string
  url?: string
  data?: string
  filename?: string
}

type InputContent = TextInputContent | BinaryInputContent

type UserMessage = {
  id: string
  role: "user"
  content: string | InputContent[]
  name?: string
}

InputContent Types

TextInputContent

Represents text content within a multimodal message.

interface TextInputContent {
  type: "text"
  text: string
}

Property	Type	Description
`type`	`"text"`	Identifies this as text content
`text`	`string`	The text content

BinaryInputContent

Represents binary content such as images, audio, or files.

interface BinaryInputContent {
  type: "binary"
  mimeType: string
  id?: string
  url?: string
  data?: string
  filename?: string
}

Property	Type	Description
`type`	`"binary"`	Identifies this as binary content
`mimeType`	`string`	MIME type of the content (e.g., “image/jpeg”, “audio/wav”)
`id`	`string?`	Optional identifier for content reference
`url`	`string?`	Optional URL to fetch the content
`data`	`string?`	Optional base64-encoded content
`filename`	`string?`	Optional filename for the content

Content Delivery Methods

Binary content can be provided through multiple methods:

Inline Data: Base64-encoded in the data field
URL Reference: External URL in the url field
ID Reference: Reference to pre-uploaded content via id field

At least one of data, url, or id must be provided for binary content.

Implementation Examples

Simple Text Message (Backward Compatible)

{
  "id": "msg-001",
  "role": "user",
  "content": "What's in this image?"
}

Image with Text

{
  "id": "msg-002",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What's in this image?"
    },
    {
      "type": "binary",
      "mimeType": "image/jpeg",
      "data": "base64-encoded-image-data..."
    }
  ]
}

Multiple Images with Question

{
  "id": "msg-003",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What are the differences between these images?"
    },
    {
      "type": "binary",
      "mimeType": "image/png",
      "url": "https://example.com/image1.png"
    },
    {
      "type": "binary",
      "mimeType": "image/png",
      "url": "https://example.com/image2.png"
    }
  ]
}

Audio Transcription Request

{
  "id": "msg-004",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Please transcribe this audio recording"
    },
    {
      "type": "binary",
      "mimeType": "audio/wav",
      "filename": "meeting-recording.wav",
      "id": "audio-upload-123"
    }
  ]
}

Document Analysis

{
  "id": "msg-005",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Summarize the key points from this PDF"
    },
    {
      "type": "binary",
      "mimeType": "application/pdf",
      "filename": "quarterly-report.pdf",
      "url": "https://example.com/reports/q4-2024.pdf"
    }
  ]
}

Implementation Considerations

Client SDK Changes

TypeScript SDK:

Extended UserMessage type in @ag-ui/core
Content validation utilities
Helper methods for constructing multimodal messages
Binary content encoding/decoding utilities

Python SDK:

Extended UserMessage class
Content type validation
Multimodal message builders
Binary content handling utilities

Framework Integration

Frameworks need to:

Parse multimodal user messages
Forward content to LLM providers that support multimodal inputs
Handle fallbacks for models that don’t support certain content types
Manage content upload/storage for binary data

Use Cases

Visual Question Answering

Users can upload images and ask questions about them.

Document Processing

Upload PDFs, Word documents, or spreadsheets for analysis.

Audio Transcription and Analysis

Process voice recordings, podcasts, or meeting audio.

Multi-document Comparison

Compare multiple images, documents, or mixed media.

Screenshot Analysis

Share screenshots for UI/UX feedback or debugging assistance.

Testing Strategy

Unit tests for content type validation
Integration tests with multimodal LLMs
Backward compatibility tests with string content
Performance tests for large binary payloads
Security tests for content validation and sanitization

Get Started

Concepts

Draft Proposals

Tutorials

Development

Summary

Problem Statement

Motivation

Status

Detailed Specification

Overview

InputContent Types

TextInputContent

BinaryInputContent

Content Delivery Methods

Implementation Examples

Simple Text Message (Backward Compatible)

Image with Text

Multiple Images with Question

Audio Transcription Request

Document Analysis

Implementation Considerations

Client SDK Changes

Framework Integration

Use Cases

Visual Question Answering

Document Processing

Audio Transcription and Analysis

Multi-document Comparison

Screenshot Analysis

Testing Strategy

References

Get Started

Concepts

Draft Proposals

Tutorials

Development

​Multi-modal Messages Proposal

​Summary

​Problem Statement

​Motivation

​Status

​Detailed Specification

​Overview

​InputContent Types

​TextInputContent

​BinaryInputContent

​Content Delivery Methods

​Implementation Examples

​Simple Text Message (Backward Compatible)

​Image with Text

​Multiple Images with Question

​Audio Transcription Request

​Document Analysis

​Implementation Considerations

​Client SDK Changes

​Framework Integration

​Use Cases

​Visual Question Answering

​Document Processing

​Audio Transcription and Analysis

​Multi-document Comparison

​Screenshot Analysis

​Testing Strategy

​References

Multi-modal Messages Proposal

Summary

Problem Statement

Motivation

Status

Detailed Specification

Overview

InputContent Types

TextInputContent

BinaryInputContent

Content Delivery Methods

Implementation Examples

Simple Text Message (Backward Compatible)

Image with Text

Multiple Images with Question

Audio Transcription Request

Document Analysis

Implementation Considerations

Client SDK Changes

Framework Integration

Use Cases

Visual Question Answering

Document Processing

Audio Transcription and Analysis

Multi-document Comparison

Screenshot Analysis

Testing Strategy

References