Overview
Create real-time AI avatar conversations by combining two simple components: a Daily.co room for video delivery and a WebSocket for audio commands.How It Works
- Receives your audio commands via WebSocket
- Generates lifelike talking head video using GPU acceleration
- Streams the video to your Daily.co room as a participant
- Join the Daily.co room to receive video
- Send audio commands via WebSocket
Why This Architecture?
✅ No WebRTC Complexity - Daily.co handles all WebRTC setup (STUN, TURN, ICE, SDP)✅ Reliable Streaming - Daily.co’s infrastructure ensures smooth video delivery
✅ Simple Integration - Just join a room and send JSON messages
✅ Scalable - Works behind corporate firewalls and restrictive NATs
✅ Low Latency - ~500ms end-to-end from audio to video
Key Features
- 🎭 Real-time talking head video generation from audio
- 🚀 GPU-accelerated video processing for low latency
- 🎥 Professional video quality with synchronized audio
- 🔄 Multi-session support with independent avatars
- 🌐 Works globally with Daily.co’s CDN
- 📱 Cross-platform (web, mobile, desktop)
Quick Start
1. Create and Start a Session
First, create a session using the Create Session endpoint, then start it:session_id- Session identifieraccess_token- Authentication tokendaily_room- Daily.co room details (url, name, token)ws_uri- WebSocket URI for the avatar server (after starting)
2. Connect to Daily.co Room (Video)
The avatar server streams video into the Daily.co room. Your client connects to receive it:No WebRTC Setup Required: Daily.co handles all video streaming complexity. You don’t need to manage WebRTC connections - just join the room and send audio commands.
3. Connect to Avatar Server WebSocket (Audio Commands)
Connect to the avatar server WebSocket to send audio commands:WebSocket Purpose: The WebSocket is only for sending audio commands (
agent.speak, agent.interrupt). Video is streamed directly to the Daily.co room by the avatar server.4. Send Audio for Video Generation
Send audio via WebSocket to the avatar server as 16-bit, mono PCM encoded in base64. If you send 48kHz audio, includesample_rate: 48000.
Connection Flow
Key Points:- Client joins Daily.co room to receive video
- Client connects to WebSocket to send audio commands
- Avatar server automatically joins Daily.co room as a participant
- Client sends audio via WebSocket
- Avatar server generates and streams video to Daily.co room
- Client receives video from Daily.co room (handled by Daily.co infrastructure)
Audio Requirements
| Parameter | Value | Description |
|---|---|---|
| Format | Raw PCM | Raw audio bytes (not WAV file) |
| Sample Rate | 48000 Hz | Input sample rate |
| Channels | Mono | Single channel audio |
| Bit Depth | 16-bit signed | Audio bit depth |
| Encoding | Base64 | Encode raw bytes as base64 string |
| Max Size | 100 MB | Per WebSocket message |
Video Output
| Parameter | Value | Description |
|---|---|---|
| Format | H.264 | Video codec |
| Frame Rate | 25 FPS | Default frame rate |
| Resolution | Avatar dependent | Landscape (16:9), Portrait (9:16), or Square (1:1) |
| Delivery | Daily.co Room | Video and audio tracks |
| Latency | ~500ms | End-to-end latency |
Message Types Overview
Client → Server (WebSocket)
- session.init - Initialize session with credentials
- agent.speak - Send audio for video generation
- agent.interrupt - Stop current playback
Server → Client (WebSocket)
- connection.established - Session ready
- agent.speak.confirmed - Audio received confirmation
- agent.interrupt.confirmed - Interrupt confirmed
- error - Error occurred
Common Errors
| Error | Cause | Solution |
|---|---|---|
First message must be session.init | WebSocket initialized incorrectly | Send session.init as the first message after connecting |
Invalid room platform | Missing/invalid room.platform | Set room.platform to daily |
Session not found | Invalid/expired session | Create new session |
Failed to process audio | Wrong audio format | Use 16-bit mono PCM and set the correct sample_rate |
No audio data provided | Empty audio field | Include base64-encoded audio |
Complete Examples
We provide full working examples for multiple platforms:Browser (HTML + JavaScript)
Python Client
Node.js Client
Best Practices
Session Management
- Always send
session.initfirst after connecting to WebSocket - Reuse sessions for multiple audio requests
- When finished: close the WebSocket, leave the Daily.co room, and call
POST /v1/sessions/{session_id}/end - Implement reconnection logic with exponential backoff
Audio Quality
- Use clean audio without background noise
- Normalize audio levels to prevent clipping
- Send 5-10 second chunks for optimal performance
- Validate format before sending (16-bit, mono PCM + correct
sample_rate, e.g. 48000)
Daily.co Connection
- Join the Daily.co room before sending audio
- Monitor participant events to know when avatar joins
- Handle network disconnections gracefully
- The avatar server automatically joins as a participant
Performance
- Keep WebSocket connections alive for multiple requests
- Implement proper error handling
- Monitor network quality
- Use appropriate chunk sizes
Troubleshooting
Daily.co Connection Issues
- Verify Daily.co room URL and token are valid
- Check network connectivity
- Monitor participant events in Daily.co room
Audio/Video Sync Issues
- Check network latency
- Ensure consistent audio format (48kHz, 16-bit, mono PCM)
- Monitor video playback in Daily.co room
Performance Issues
- Verify audio format and chunk size
- Check network bandwidth
- Monitor server response times