Multi-modal Messages Proposal

Summary

Problem Statement

Current AG-UI protocol only supports text-based user messages. As LLMs increasingly support multimodal inputs (images, audio, files), the protocol needs to evolve to handle these richer input types.

Motivation

Evolve AG-UI to support multimodal input messages without breaking existing apps. Inputs may include text, images, audio, and files.

Status

Detailed Specification

Overview

Extend the UserMessage content property to be either a string or an array of InputContent:
interface TextInputContent {
  type: "text"
  text: string
}

interface BinaryInputContent {
  type: "binary"
  mimeType: string
  id?: string
  url?: string
  data?: string
  filename?: string
}

type InputContent = TextInputContent | BinaryInputContent

type UserMessage = {
  id: string
  role: "user"
  content: string | InputContent[]
  name?: string
}

InputContent Types

TextInputContent

Represents text content within a multimodal message.
interface TextInputContent {
  type: "text"
  text: string
}
PropertyTypeDescription
type"text"Identifies this as text content
textstringThe text content

BinaryInputContent

Represents binary content such as images, audio, or files.
interface BinaryInputContent {
  type: "binary"
  mimeType: string
  id?: string
  url?: string
  data?: string
  filename?: string
}
PropertyTypeDescription
type"binary"Identifies this as binary content
mimeTypestringMIME type of the content (e.g., “image/jpeg”, “audio/wav”)
idstring?Optional identifier for content reference
urlstring?Optional URL to fetch the content
datastring?Optional base64-encoded content
filenamestring?Optional filename for the content

Content Delivery Methods

Binary content can be provided through multiple methods:
  1. Inline Data: Base64-encoded in the data field
  2. URL Reference: External URL in the url field
  3. ID Reference: Reference to pre-uploaded content via id field
At least one of data, url, or id must be provided for binary content.

Implementation Examples

Simple Text Message (Backward Compatible)

{
  "id": "msg-001",
  "role": "user",
  "content": "What's in this image?"
}

Image with Text

{
  "id": "msg-002",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What's in this image?"
    },
    {
      "type": "binary",
      "mimeType": "image/jpeg",
      "data": "base64-encoded-image-data..."
    }
  ]
}

Multiple Images with Question

{
  "id": "msg-003",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What are the differences between these images?"
    },
    {
      "type": "binary",
      "mimeType": "image/png",
      "url": "https://example.com/image1.png"
    },
    {
      "type": "binary",
      "mimeType": "image/png",
      "url": "https://example.com/image2.png"
    }
  ]
}

Audio Transcription Request

{
  "id": "msg-004",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Please transcribe this audio recording"
    },
    {
      "type": "binary",
      "mimeType": "audio/wav",
      "filename": "meeting-recording.wav",
      "id": "audio-upload-123"
    }
  ]
}

Document Analysis

{
  "id": "msg-005",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Summarize the key points from this PDF"
    },
    {
      "type": "binary",
      "mimeType": "application/pdf",
      "filename": "quarterly-report.pdf",
      "url": "https://example.com/reports/q4-2024.pdf"
    }
  ]
}

Implementation Considerations

Client SDK Changes

TypeScript SDK:
  • Extended UserMessage type in @ag-ui/core
  • Content validation utilities
  • Helper methods for constructing multimodal messages
  • Binary content encoding/decoding utilities
Python SDK:
  • Extended UserMessage class
  • Content type validation
  • Multimodal message builders
  • Binary content handling utilities

Framework Integration

Frameworks need to:
  • Parse multimodal user messages
  • Forward content to LLM providers that support multimodal inputs
  • Handle fallbacks for models that don’t support certain content types
  • Manage content upload/storage for binary data

Use Cases

Visual Question Answering

Users can upload images and ask questions about them.

Document Processing

Upload PDFs, Word documents, or spreadsheets for analysis.

Audio Transcription and Analysis

Process voice recordings, podcasts, or meeting audio.

Multi-document Comparison

Compare multiple images, documents, or mixed media.

Screenshot Analysis

Share screenshots for UI/UX feedback or debugging assistance.

Testing Strategy

  • Unit tests for content type validation
  • Integration tests with multimodal LLMs
  • Backward compatibility tests with string content
  • Performance tests for large binary payloads
  • Security tests for content validation and sanitization

References