Skip to main content

Overview

This reference documents the WebSocket messages you send to control the avatar server. These messages are simple JSON commands for audio playback, interruption and session management.

WebSocket Connection

WebSocket URL: wss://wss.agenthuman.com Connect to this URL after creating a session via the REST API. The WebSocket server will route your connection to the appropriate backend server.

Important: WebSocket is for Commands Only

The WebSocket connection is used exclusively for sending audio commands to the avatar server. It is NOT used for WebRTC signaling or video streaming. Video is delivered through your Daily or LiveKit room. Architecture:
  • Daily or LiveKit Room → Where you receive avatar video (WebRTC handled for you)
  • WebSocket → Where you send audio commands (this reference)

Message Format

All messages must be valid JSON with a type field:
{
  "type": "message_type",
  "field1": "value1",
  "field2": "value2"
}

Message Types

1. Session Initialize

Initialize a new session with credentials. Must be sent first after connecting.
{
  "type": "session.init",
  "session_id": "your-session-id",
  "session_token": "your-access-token"
}
Required Fields:
  • session_id - Your Agent Human session ID (from session creation)
  • session_token - Your session access token (from session creation)
How It Works:
  1. You send the session.init message with your session ID and token
  2. The server validates your credentials and fetches the session configuration from the API
  3. The session configuration (room details, avatar video, video dimensions) is retrieved server-side
  4. The avatar joins the Daily/LiveKit room automatically
  5. Server responds with connection.established
Response: Server will send connection.established on success. Note: All session configuration (room platform, room URL, avatar video, video dimensions, etc.) is configured when you create the session via the REST API. You don’t need to send these details in the WebSocket message - the server fetches them automatically using your session credentials.

2. Send Audio for Video Generation

Send audio data to generate talking head video. Audio must be 16-bit, mono PCM raw bytes encoded in base64.
{
  "type": "agent.speak",
  "audio": "base64_encoded_raw_pcm_audio",
  "sample_rate": 48000
}
Required Fields:
  • audio - Base64-encoded raw PCM audio bytes
Optional Fields:
  • sample_rate - The sample rate of the PCM audio you’re sending. Defaults to 16000. If you send 48kHz audio, set sample_rate: 48000.
Audio Format Requirements:
  • Sample Rate: 16kHz (16000 Hz) by default or provide sample_rate (e.g. 48000)
  • Format: Raw PCM, 16-bit signed integer (not WAV file)
  • Channels: Mono (single channel)
  • Encoding: Base64 string
  • Max Size: 10 MB per message
Response: Server will send agent.speak.confirmed with the number of audio samples received. Examples:
// Convert AudioBuffer to raw PCM
const channelData = audioBuffer.getChannelData(0); // Mono
const samples = new Int16Array(channelData.length);

// Convert float32 to int16
for (let i = 0; i < channelData.length; i++) {
    const s = Math.max(-1, Math.min(1, channelData[i]));
    samples[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}

// Encode to base64
const base64Audio = btoa(
    String.fromCharCode.apply(null, new Uint8Array(samples.buffer))
);

ws.send(JSON.stringify({
    type: 'agent.speak',
    audio: base64Audio,
    sample_rate: 48000
}));

3. Interrupt Playback

Stop current video generation and playback immediately.
{
  "type": "agent.interrupt",
}
Required Fields: None Response: Server will send agent.interrupt.confirmed. Use Cases:
  • User wants to skip current speech
  • New urgent message needs to be displayed
  • Cancel ongoing generation
Examples:
function interrupt() {
    ws.send(JSON.stringify({
        type: 'agent.interrupt'
    }));
}

4. Cleanup (End Session)

There is no WebSocket session.stop message. When you’re done:
  1. Close the WebSocket connection
  2. Leave your video room
  3. Call the REST endpoint POST /v1/sessions/{session_id}/end to end the session

Message Flow

Typical message sequence for an avatar session:
  1. Create session → Call REST API POST /v1/sessions to create a session
  2. Join your video room → Connect to Daily or LiveKit room to receive avatar video
  3. Connect WebSocket → Connect to wss://wss.agenthuman.com
  4. Send session.init → Server responds with connection.established
  5. Avatar joins video room → Avatar appears as participant
  6. Send agent.speak (one or more times) → Server generates video and streams to your video room
  7. Receive video from your room → Avatar video appears in the room
  8. Optional: Send agent.interrupt → Stop current playback
  9. Close WebSocket and leave video room
  10. Call POST /v1/sessions/{session_id}/end → Release resources and mark the session ended

Common Errors

Error MessageCauseSolution
First message must be session.initWebSocket initialized incorrectlySend session.init as the first message after connecting
Missing session idThe session_id field is missingInclude session_id in the session.init message
Invalid session id - server not assigned or session id mismatchSession was routed to the wrong server or server not assignedUse the server_ws_uri provided when creating/starting the session
Failed to fetch video pathSession configuration could not be retrieved from APIEnsure the session exists and the session_token is valid
Session not foundInvalid or expired sessionCreate a new session and re-establish connection with session.init
Invalid JSON formatMalformed JSONValidate JSON structure before sending
No audio data providedEmpty audio fieldInclude base64-encoded audio in agent.speak message
Failed to process audioInvalid audio formatVerify 16-bit mono PCM and set the correct sample_rate
Unknown message typeUnsupported message typeCheck message type is one of the documented types (session.init, agent.speak, agent.interrupt)