Client → Server Messages

Overview

This reference documents the WebSocket messages you send to control the avatar server. These messages are simple JSON commands for audio playback, interruption, and session management.

Important: WebSocket is for Commands Only

The WebSocket connection is used exclusively for sending audio commands to the avatar server. It is NOT used for WebRTC signaling or video streaming. Video is delivered through your Daily.co room. Architecture:

Daily.co Room → Where you receive avatar video (WebRTC handled for you)
WebSocket → Where you send audio commands (this reference)

Message Format

All messages must be valid JSON with a type field:

{
  "type": "message_type",
  "field1": "value1",
  "field2": "value2"
}

Message Types

1. Session Initialize

Initialize a new session with credentials. Must be sent first after connecting.

{
  "type": "session.init",
  "config": {
    "session_id": "your-session-id",
    "access_token": "your-access-token",
    "room": {
      "platform": "daily",
      "url": "https://agenthuman.daily.co/your-room-name",
      "token": "your-daily-token",
      "display_name": "AI Avatar (AH)"
    },
    "video_width": 1280,
    "video_height": 720
  }
}

Required Fields:

session_id - Your AgentHuman session ID
access_token - Your session access token
room.platform - Must be daily
room.url - Daily.co room URL from session.daily_room.url
room.token - Daily.co meeting token from session.daily_room.token

Optional Fields:

room.display_name - Display name used when the avatar joins the Daily room
video_width - Video output width in pixels (default: 1280)
video_height - Video output height in pixels (default: 720)
Supported aspect ratios: 18:9, 16:9, 5:3, 16:10, 3:2, 4:3, 1:1 (landscape or portrait)

Response: Server will send connection.established on success.

2. Send Audio for Video Generation

Send audio data to generate talking head video. Audio must be 16-bit, mono PCM raw bytes encoded in base64.

{
  "type": "agent.speak",
  "audio": "base64_encoded_raw_pcm_audio",
  "sample_rate": 48000
}

Required Fields:

audio - Base64-encoded raw PCM audio bytes

Optional Fields:

sample_rate - The sample rate of the PCM audio you’re sending. Defaults to 16000. If you send 48kHz audio, set sample_rate: 48000.

Audio Format Requirements:

Sample Rate: 16kHz (16000 Hz) by default, or provide sample_rate (e.g. 48000)
Format: Raw PCM, 16-bit signed integer (not WAV file)
Channels: Mono (single channel)
Encoding: Base64 string
Max Size: 100 MB per message

Response: Server will send agent.speak.confirmed with the number of audio samples received. Examples:

// Convert AudioBuffer to raw PCM
const channelData = audioBuffer.getChannelData(0); // Mono
const samples = new Int16Array(channelData.length);

// Convert float32 to int16
for (let i = 0; i < channelData.length; i++) {
    const s = Math.max(-1, Math.min(1, channelData[i]));
    samples[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}

// Encode to base64
const base64Audio = btoa(
    String.fromCharCode.apply(null, new Uint8Array(samples.buffer))
);

ws.send(JSON.stringify({
    type: 'agent.speak',
    audio: base64Audio,
    sample_rate: 48000
}));

3. Interrupt Playback

Stop current video generation and playback immediately.

{
  "type": "agent.interrupt",
}

Required Fields: None Response: Server will send agent.interrupt.confirmed. Use Cases:

User wants to skip current speech
New urgent message needs to be displayed
Cancel ongoing generation

Examples:

function interrupt() {
    ws.send(JSON.stringify({
        type: 'agent.interrupt'
    }));
}

4. Cleanup (End Session)

There is no WebSocket session.stop message. When you’re done:

Close the WebSocket connection
Leave the Daily.co room
Call the REST endpoint POST /v1/sessions/{session_id}/end to end the session

Message Flow

Typical message sequence for an avatar session:

Join Daily.co room → Connect to video room to receive avatar video
Connect WebSocket → Connect to avatar server
Send session.init → Server responds with connection.established
Avatar joins Daily.co room → Avatar appears as participant
Send agent.speak (one or more times) → Server generates video and streams to Daily.co
Receive video from Daily.co → Avatar video appears in the room
Optional: Send agent.interrupt → Stop current playback
Close WebSocket and leave Daily.co room
Call POST /v1/sessions/{session_id}/end → Release resources and mark the session ended

Common Errors

Error Message	Cause	Solution
`First message must be session.init`	WebSocket initialized incorrectly	Send `session.init` as the first message after connecting
`Invalid room platform`	Missing/invalid `room.platform`	Set `room.platform` to `daily`
`Invalid video dimensions`	Non-positive or non-numeric video size	Use positive integers for `video_width` and `video_height`
`Invalid aspect ratio`	Unsupported width/height ratio	Use one of: 18:9, 16:9, 5:3, 16:10, 3:2, 4:3, 1:1 (or portrait equivalents)
`Session not found`	Invalid or expired session	Re-establish session with session.init
`Invalid JSON format`	Malformed JSON	Validate JSON structure before sending
`No audio data provided`	Empty audio field	Include base64-encoded audio
`Failed to process audio`	Invalid audio format	Verify 16-bit mono PCM and set the correct `sample_rate`
`Unknown message type`	Unsupported message type	Check message type is one of the documented types

Getting Started

Avatars

Sessions

Video Streaming

Schemas

Client → Server Messages

Overview

Important: WebSocket is for Commands Only

Message Format

Message Types

1. Session Initialize

2. Send Audio for Video Generation

3. Interrupt Playback

4. Cleanup (End Session)

Message Flow

Common Errors

Getting Started

Avatars

Sessions

Video Streaming

Schemas

​Overview

​Important: WebSocket is for Commands Only

​Message Format

​Message Types

​1. Session Initialize

​2. Send Audio for Video Generation

​3. Interrupt Playback

​4. Cleanup (End Session)

​Message Flow

​Common Errors

Overview

Important: WebSocket is for Commands Only

Message Format

Message Types

1. Session Initialize

2. Send Audio for Video Generation

3. Interrupt Playback

4. Cleanup (End Session)

Message Flow

Common Errors