Overview
This reference documents the WebSocket messages you send to control the avatar server. These messages are simple JSON commands for audio playback, interruption, and session management.Important: WebSocket is for Commands Only
The WebSocket connection is used exclusively for sending audio commands to the avatar server. It is NOT used for WebRTC signaling or video streaming. Video is delivered through your Daily.co room. Architecture:- Daily.co Room → Where you receive avatar video (WebRTC handled for you)
- WebSocket → Where you send audio commands (this reference)
Message Format
All messages must be valid JSON with atype field:
Message Types
1. Session Initialize
Initialize a new session with credentials. Must be sent first after connecting.session_id- Your AgentHuman session IDaccess_token- Your session access tokenroom.platform- Must bedailyroom.url- Daily.co room URL fromsession.daily_room.urlroom.token- Daily.co meeting token fromsession.daily_room.token
room.display_name- Display name used when the avatar joins the Daily roomvideo_width- Video output width in pixels (default: 1280)video_height- Video output height in pixels (default: 720)- Supported aspect ratios: 18:9, 16:9, 5:3, 16:10, 3:2, 4:3, 1:1 (landscape or portrait)
connection.established on success.
2. Send Audio for Video Generation
Send audio data to generate talking head video. Audio must be 16-bit, mono PCM raw bytes encoded in base64.audio- Base64-encoded raw PCM audio bytes
sample_rate- The sample rate of the PCM audio you’re sending. Defaults to16000. If you send 48kHz audio, setsample_rate: 48000.
- Sample Rate: 16kHz (16000 Hz) by default, or provide
sample_rate(e.g. 48000) - Format: Raw PCM, 16-bit signed integer (not WAV file)
- Channels: Mono (single channel)
- Encoding: Base64 string
- Max Size: 100 MB per message
agent.speak.confirmed with the number of audio samples received.
Examples:
3. Interrupt Playback
Stop current video generation and playback immediately.agent.interrupt.confirmed.
Use Cases:
- User wants to skip current speech
- New urgent message needs to be displayed
- Cancel ongoing generation
4. Cleanup (End Session)
There is no WebSocketsession.stop message. When you’re done:
- Close the WebSocket connection
- Leave the Daily.co room
- Call the REST endpoint
POST /v1/sessions/{session_id}/endto end the session
Message Flow
Typical message sequence for an avatar session:- Join Daily.co room → Connect to video room to receive avatar video
- Connect WebSocket → Connect to avatar server
- Send
session.init→ Server responds withconnection.established - Avatar joins Daily.co room → Avatar appears as participant
- Send
agent.speak(one or more times) → Server generates video and streams to Daily.co - Receive video from Daily.co → Avatar video appears in the room
- Optional: Send
agent.interrupt→ Stop current playback - Close WebSocket and leave Daily.co room
- Call
POST /v1/sessions/{session_id}/end→ Release resources and mark the session ended
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
First message must be session.init | WebSocket initialized incorrectly | Send session.init as the first message after connecting |
Invalid room platform | Missing/invalid room.platform | Set room.platform to daily |
Invalid video dimensions | Non-positive or non-numeric video size | Use positive integers for video_width and video_height |
Invalid aspect ratio | Unsupported width/height ratio | Use one of: 18:9, 16:9, 5:3, 16:10, 3:2, 4:3, 1:1 (or portrait equivalents) |
Session not found | Invalid or expired session | Re-establish session with session.init |
Invalid JSON format | Malformed JSON | Validate JSON structure before sending |
No audio data provided | Empty audio field | Include base64-encoded audio |
Failed to process audio | Invalid audio format | Verify 16-bit mono PCM and set the correct sample_rate |
Unknown message type | Unsupported message type | Check message type is one of the documented types |
- Learn about Server → Client Messages
- See Complete Examples
- Review Best Practices