Skip to main content

Overview

This reference documents the WebSocket messages you send to control the avatar server. These messages are simple JSON commands for audio playback, interruption, and session management.

Important: WebSocket is for Commands Only

The WebSocket connection is used exclusively for sending audio commands to the avatar server. It is NOT used for WebRTC signaling or video streaming. Video is delivered through your Daily.co room. Architecture:
  • Daily.co Room → Where you receive avatar video (WebRTC handled for you)
  • WebSocket → Where you send audio commands (this reference)

Message Format

All messages must be valid JSON with a type field:
{
  "type": "message_type",
  "field1": "value1",
  "field2": "value2"
}

Message Types

1. Session Initialize

Initialize a new session with credentials. Must be sent first after connecting.
{
  "type": "session.init",
  "config": {
    "session_id": "your-session-id",
    "access_token": "your-access-token",
    "room": {
      "platform": "daily",
      "url": "https://agenthuman.daily.co/your-room-name",
      "token": "your-daily-token",
      "display_name": "AI Avatar (AH)"
    },
    "video_width": 1280,
    "video_height": 720
  }
}
Required Fields:
  • session_id - Your AgentHuman session ID
  • access_token - Your session access token
  • room.platform - Must be daily
  • room.url - Daily.co room URL from session.daily_room.url
  • room.token - Daily.co meeting token from session.daily_room.token
Optional Fields:
  • room.display_name - Display name used when the avatar joins the Daily room
  • video_width - Video output width in pixels (default: 1280)
  • video_height - Video output height in pixels (default: 720)
  • Supported aspect ratios: 18:9, 16:9, 5:3, 16:10, 3:2, 4:3, 1:1 (landscape or portrait)
Response: Server will send connection.established on success.

2. Send Audio for Video Generation

Send audio data to generate talking head video. Audio must be 16-bit, mono PCM raw bytes encoded in base64.
{
  "type": "agent.speak",
  "audio": "base64_encoded_raw_pcm_audio",
  "sample_rate": 48000
}
Required Fields:
  • audio - Base64-encoded raw PCM audio bytes
Optional Fields:
  • sample_rate - The sample rate of the PCM audio you’re sending. Defaults to 16000. If you send 48kHz audio, set sample_rate: 48000.
Audio Format Requirements:
  • Sample Rate: 16kHz (16000 Hz) by default, or provide sample_rate (e.g. 48000)
  • Format: Raw PCM, 16-bit signed integer (not WAV file)
  • Channels: Mono (single channel)
  • Encoding: Base64 string
  • Max Size: 100 MB per message
Response: Server will send agent.speak.confirmed with the number of audio samples received. Examples:
// Convert AudioBuffer to raw PCM
const channelData = audioBuffer.getChannelData(0); // Mono
const samples = new Int16Array(channelData.length);

// Convert float32 to int16
for (let i = 0; i < channelData.length; i++) {
    const s = Math.max(-1, Math.min(1, channelData[i]));
    samples[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}

// Encode to base64
const base64Audio = btoa(
    String.fromCharCode.apply(null, new Uint8Array(samples.buffer))
);

ws.send(JSON.stringify({
    type: 'agent.speak',
    audio: base64Audio,
    sample_rate: 48000
}));

3. Interrupt Playback

Stop current video generation and playback immediately.
{
  "type": "agent.interrupt",
}
Required Fields: None Response: Server will send agent.interrupt.confirmed. Use Cases:
  • User wants to skip current speech
  • New urgent message needs to be displayed
  • Cancel ongoing generation
Examples:
function interrupt() {
    ws.send(JSON.stringify({
        type: 'agent.interrupt'
    }));
}

4. Cleanup (End Session)

There is no WebSocket session.stop message. When you’re done:
  1. Close the WebSocket connection
  2. Leave the Daily.co room
  3. Call the REST endpoint POST /v1/sessions/{session_id}/end to end the session

Message Flow

Typical message sequence for an avatar session:
  1. Join Daily.co room → Connect to video room to receive avatar video
  2. Connect WebSocket → Connect to avatar server
  3. Send session.init → Server responds with connection.established
  4. Avatar joins Daily.co room → Avatar appears as participant
  5. Send agent.speak (one or more times) → Server generates video and streams to Daily.co
  6. Receive video from Daily.co → Avatar video appears in the room
  7. Optional: Send agent.interrupt → Stop current playback
  8. Close WebSocket and leave Daily.co room
  9. Call POST /v1/sessions/{session_id}/end → Release resources and mark the session ended

Common Errors

Error MessageCauseSolution
First message must be session.initWebSocket initialized incorrectlySend session.init as the first message after connecting
Invalid room platformMissing/invalid room.platformSet room.platform to daily
Invalid video dimensionsNon-positive or non-numeric video sizeUse positive integers for video_width and video_height
Invalid aspect ratioUnsupported width/height ratioUse one of: 18:9, 16:9, 5:3, 16:10, 3:2, 4:3, 1:1 (or portrait equivalents)
Session not foundInvalid or expired sessionRe-establish session with session.init
Invalid JSON formatMalformed JSONValidate JSON structure before sending
No audio data providedEmpty audio fieldInclude base64-encoded audio
Failed to process audioInvalid audio formatVerify 16-bit mono PCM and set the correct sample_rate
Unknown message typeUnsupported message typeCheck message type is one of the documented types