Multi-modal Messages Proposal
Summary
Problem Statement
Current AG-UI protocol only supports text-based user messages. As LLMs increasingly support multimodal inputs (images, audio, files), the protocol needs to evolve to handle these richer input types.Motivation
Evolve AG-UI to support multimodal input messages without breaking existing apps. Inputs may include text, images, audio, and files.Status
- Status: Draft
- Author(s): Markus Ecker (mail@mme.xyz)
Detailed Specification
Overview
Extend theUserMessage
content
property to be either a string or an array of
InputContent
:
InputContent Types
TextInputContent
Represents text content within a multimodal message.Property | Type | Description |
---|---|---|
type | "text" | Identifies this as text content |
text | string | The text content |
BinaryInputContent
Represents binary content such as images, audio, or files.Property | Type | Description |
---|---|---|
type | "binary" | Identifies this as binary content |
mimeType | string | MIME type of the content (e.g., “image/jpeg”, “audio/wav”) |
id | string? | Optional identifier for content reference |
url | string? | Optional URL to fetch the content |
data | string? | Optional base64-encoded content |
filename | string? | Optional filename for the content |
Content Delivery Methods
Binary content can be provided through multiple methods:- Inline Data: Base64-encoded in the
data
field - URL Reference: External URL in the
url
field - ID Reference: Reference to pre-uploaded content via
id
field
data
, url
, or id
must be provided for binary content.
Implementation Examples
Simple Text Message (Backward Compatible)
Image with Text
Multiple Images with Question
Audio Transcription Request
Document Analysis
Implementation Considerations
Client SDK Changes
TypeScript SDK:- Extended
UserMessage
type in@ag-ui/core
- Content validation utilities
- Helper methods for constructing multimodal messages
- Binary content encoding/decoding utilities
- Extended
UserMessage
class - Content type validation
- Multimodal message builders
- Binary content handling utilities
Framework Integration
Frameworks need to:- Parse multimodal user messages
- Forward content to LLM providers that support multimodal inputs
- Handle fallbacks for models that don’t support certain content types
- Manage content upload/storage for binary data
Use Cases
Visual Question Answering
Users can upload images and ask questions about them.Document Processing
Upload PDFs, Word documents, or spreadsheets for analysis.Audio Transcription and Analysis
Process voice recordings, podcasts, or meeting audio.Multi-document Comparison
Compare multiple images, documents, or mixed media.Screenshot Analysis
Share screenshots for UI/UX feedback or debugging assistance.Testing Strategy
- Unit tests for content type validation
- Integration tests with multimodal LLMs
- Backward compatibility tests with string content
- Performance tests for large binary payloads
- Security tests for content validation and sanitization