Multi-modal Messages Proposal
Summary
Problem Statement
Current AG-UI protocol only supports text-based user messages. As LLMs increasingly support multimodal inputs (images, audio, files), the protocol needs to evolve to handle these richer input types.Motivation
Evolve AG-UI to support multimodal input messages without breaking existing apps. Inputs may include text, images, audio, video, and documents. Each modality is represented as a distinct, typed content part with a clear source discriminator (data for inline base64, url for references), making it
straightforward to map to any LLM provider’s API.
Status
- Status: Implemented — October 16, 2025
- Author(s): Markus Ecker (mail@mme.xyz), Alem Tuzlak (t.zlak97@gmail.com)
Detailed Specification
Overview
Extend theUserMessage content property to be either a string or an array of
InputContentPart objects. Each modality (image, audio, video, document) has
its own dedicated part type with a typed source that is either inline data
or a url reference. This makes it trivial to map content parts to any LLM
provider’s API.
Modality Type
TheModality type enumerates the supported content modalities:
| Value | Description |
|---|---|
"text" | Plain text content |
"image" | Image content (JPEG, PNG, GIF, WebP, etc.) |
"audio" | Audio content (WAV, MP3, OGG, etc.) |
"video" | Video content (MP4, WebM, etc.) |
"document" | Document content (PDF, DOCX, XLSX, etc.) |
Source Types
Every non-text content part carries asource property that describes how the
content is delivered. The source is a discriminated union with two variants:
InputContentDataSource
Inline base64-encoded content.| Property | Type | Required | Description |
|---|---|---|---|
type | "data" | ✓ | Discriminator for inline data |
value | string | ✓ | Base64-encoded content |
mimeType | string | ✓ | MIME type (required to ensure correct handling) |
InputContentUrlSource
URL-referenced content.| Property | Type | Required | Description |
|---|---|---|---|
type | "url" | ✓ | Discriminator for URL reference |
value | string | ✓ | HTTP(S) URL or data URI |
mimeType | string? | Optional MIME type hint |
Content Part Types
TextInputPart
Represents plain text content within a multimodal message.| Property | Type | Description |
|---|---|---|
type | "text" | Identifies this as text content |
text | string | The text content |
ImageInputPart
Represents image content. Maps directly to provider image inputs (e.g., OpenAI vision, Anthropic image blocks).| Property | Type | Description |
|---|---|---|
type | "image" | Identifies this as image content |
source | InputContentSource | Either inline data or URL reference |
metadata | TMetadata? | Provider-specific metadata (e.g., OpenAI detail level) |
AudioInputPart
Represents audio content.| Property | Type | Description |
|---|---|---|
type | "audio" | Identifies this as audio content |
source | InputContentSource | Either inline data or URL reference |
metadata | TMetadata? | Provider-specific metadata (e.g., format, sample rate) |
VideoInputPart
Represents video content.| Property | Type | Description |
|---|---|---|
type | "video" | Identifies this as video content |
source | InputContentSource | Either inline data or URL reference |
metadata | TMetadata? | Provider-specific metadata (e.g., duration, resolution) |
DocumentInputPart
Represents document content such as PDFs, Word documents, or spreadsheets.| Property | Type | Description |
|---|---|---|
type | "document" | Identifies this as document content |
source | InputContentSource | Either inline data or URL reference |
metadata | TMetadata? | Provider-specific metadata (e.g., Anthropic media_type) |
Provider Metadata
The genericmetadata field on each content part allows provider-specific
information to flow through the protocol without polluting the core schema.
Examples:
- OpenAI:
ImageInputPart<{ detail: 'auto' | 'low' | 'high' }> - Anthropic:
DocumentInputPart<{ media_type: 'application/pdf' }> - Custom: Any provider can define its own metadata shape
Implementation Examples
Simple Text Message (Backward Compatible)
Image with Inline Data
Image with URL Reference
Multiple Images with Question
Audio Transcription Request
Document Analysis
Video Analysis
Mixed Modalities
Implementation Considerations
Client SDK Changes
TypeScript SDK:- New
Modalitytype and allInputContentParttypes in@ag-ui/core InputContentSource,InputContentDataSource,InputContentUrlSourcetypes- Updated
UserMessagewithcontent: string | InputContentPart[] - Helper methods for constructing typed content parts
- Provider-specific metadata generics on each content part type
- Pydantic models for each content part type (
TextInputPart,ImageInputPart, etc.) InputContentSourcediscriminated union- Updated
UserMessagemodel - Provider-specific metadata support via generics
Framework Integration
Frameworks need to:- Parse typed
InputContentPartparts and dispatch onpart.type - Map content parts to provider-specific formats (the typed structure makes this straightforward)
- Use
source.typeto determine whether to send inline data or a URL to the provider - Forward
metadatato providers that support it - Handle fallbacks for models that don’t support certain modalities
- Validate that
mimeTypeis appropriate for the declared content part type
Use Cases
Visual Question Answering
Users can upload images (ImageInputPart) and ask questions about them.
Document Processing
Upload PDFs, Word documents, or spreadsheets (DocumentInputPart) for analysis.
Audio Transcription and Analysis
Process voice recordings, podcasts, or meeting audio (AudioInputPart).
Video Understanding
Analyze video content (VideoInputPart) for summaries, descriptions, or content
moderation.
Multi-modal Comparison
Compare multiple images, documents, or mixed media using different content part types in a single message.Screenshot Analysis
Share screenshots (ImageInputPart) for UI/UX feedback or debugging assistance.
Testing Strategy
- Unit tests for each
InputContentParttype andInputContentSourcevariant - Validate
source.typediscriminator correctly narrows the union - Integration tests with multimodal LLMs (OpenAI, Anthropic, Google)
- Backward compatibility tests with plain
stringcontent - Verify
metadatapassthrough for provider-specific fields - Performance tests for large base64 payloads in
InputContentDataSource - Security tests for URL validation and content sanitization
- Type-safety tests ensuring generic
TMetadataworks across SDKs