Avatar Video Streaming

Overview

Create real-time AI avatar conversations by combining two simple components: a Daily.co room for video delivery and a WebSocket for audio commands.

How It Works

You → Send Audio (WebSocket) → Avatar Server
                                      ↓
                               Generates Video
                                      ↓
                        Streams to Daily.co Room
                                      ↓
You ← Receive Video (Daily.co) ← Daily.co Room

The avatar server does all the heavy lifting:

Receives your audio commands via WebSocket
Generates lifelike talking head video using GPU acceleration
Streams the video to your Daily.co room as a participant

You just need to:

Join the Daily.co room to receive video
Send audio commands via WebSocket

Why This Architecture?

✅ No WebRTC Complexity - Daily.co handles all WebRTC setup (STUN, TURN, ICE, SDP)
✅ Reliable Streaming - Daily.co’s infrastructure ensures smooth video delivery
✅ Simple Integration - Just join a room and send JSON messages
✅ Scalable - Works behind corporate firewalls and restrictive NATs
✅ Low Latency - ~500ms end-to-end from audio to video

Key Features

🎭 Real-time talking head video generation from audio
🚀 GPU-accelerated video processing for low latency
🎥 Professional video quality with synchronized audio
🔄 Multi-session support with independent avatars
🌐 Works globally with Daily.co’s CDN
📱 Cross-platform (web, mobile, desktop)

Quick Start

1. Create and Start a Session

First, create a session using the Create Session endpoint, then start it:

# Create session
curl -X POST https://api.agenthuman.com/v1/sessions \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "avatar_id": "your-avatar-id"
  }'

# Start session (activates avatar server)
curl -X POST https://api.agenthuman.com/v1/sessions/{session_id}/start \
  -H "x-api-key: YOUR_API_KEY"

You’ll receive:

session_id - Session identifier
access_token - Authentication token
daily_room - Daily.co room details (url, name, token)
ws_uri - WebSocket URI for the avatar server (after starting)

2. Connect to Daily.co Room (Video)

The avatar server streams video into the Daily.co room. Your client connects to receive it:

// Use Daily.co JavaScript SDK
import DailyIframe from '@daily-co/daily-js';

const callFrame = DailyIframe.createFrame({
  showLeaveButton: true,
  iframeStyle: {
    position: 'fixed',
    width: '100%',
    height: '100%',
  }
});

// Join the Daily.co room with the provided credentials
await callFrame.join({
  url: session.daily_room.url,
  token: session.daily_room.token
});

// The avatar video will automatically appear in the room
callFrame.on('participant-joined', (event) => {
  console.log('Avatar joined the room');
});

callFrame.on('track-started', (event) => {
  console.log('Track started:', event.track.kind);
  // Video and audio tracks from the avatar
});

No WebRTC Setup Required: Daily.co handles all video streaming complexity. You don’t need to manage WebRTC connections - just join the room and send audio commands.

3. Connect to Avatar Server WebSocket (Audio Commands)

Connect to the avatar server WebSocket to send audio commands:

// Connect to avatar server WebSocket
const ws = new WebSocket(ws_uri); // From start session response

ws.onopen = () => {
    console.log('Connected to avatar server');
    
    // Initialize with session credentials + Daily room info
    ws.send(JSON.stringify({
        type: 'session.init',
        config: {
            session_id: sessionId,
            access_token: accessToken,
            room: {
                platform: 'daily',
                url: session.daily_room.url,
                token: session.daily_room.token,
                display_name: 'AI Avatar (AH)'
            },
            video_width: 1280,
            video_height: 720
        }
    }));
};

ws.onmessage = (event) => {
    const message = JSON.parse(event.data);
    
    if (message.type === 'connection.established') {
        console.log('Session initialized, ready to send audio');
    } else if (message.type === 'agent.speak.confirmed') {
        console.log('Audio received:', message.audio_samples, 'samples');
        // Video will appear in Daily.co room automatically
    }
};

WebSocket Purpose: The WebSocket is only for sending audio commands (agent.speak, agent.interrupt). Video is streamed directly to the Daily.co room by the avatar server.

4. Send Audio for Video Generation

Send audio via WebSocket to the avatar server as 16-bit, mono PCM encoded in base64. If you send 48kHz audio, include sample_rate: 48000.

// Convert audio to required format
const audioContext = new AudioContext({ sampleRate: 48000 });
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
const channelData = audioBuffer.getChannelData(0);
const samples = new Int16Array(channelData.length);

// Convert to 16-bit PCM
for (let i = 0; i < channelData.length; i++) {
    const s = Math.max(-1, Math.min(1, channelData[i]));
    samples[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}

// Encode and send
const base64Audio = btoa(String.fromCharCode(...new Uint8Array(samples.buffer)));

ws.send(JSON.stringify({
    type: 'agent.speak',
    audio: base64Audio,
    sample_rate: 48000
}));

Connection Flow

Key Points:

Client joins Daily.co room to receive video
Client connects to WebSocket to send audio commands
Avatar server automatically joins Daily.co room as a participant
Client sends audio via WebSocket
Avatar server generates and streams video to Daily.co room
Client receives video from Daily.co room (handled by Daily.co infrastructure)

Audio Requirements

Parameter	Value	Description
Format	Raw PCM	Raw audio bytes (not WAV file)
Sample Rate	48000 Hz	Input sample rate
Channels	Mono	Single channel audio
Bit Depth	16-bit signed	Audio bit depth
Encoding	Base64	Encode raw bytes as base64 string
Max Size	100 MB	Per WebSocket message

Important: Send raw PCM bytes, not WAV file format. The audio will be automatically resampled to 16kHz internally and synchronized with video playback.

Video Output

Parameter	Value	Description
Format	H.264	Video codec
Frame Rate	25 FPS	Default frame rate
Resolution	Avatar dependent	Landscape (16:9), Portrait (9:16), or Square (1:1)
Delivery	Daily.co Room	Video and audio tracks
Latency	~500ms	End-to-end latency

Message Types Overview

Client → Server (WebSocket)

session.init - Initialize session with credentials
agent.speak - Send audio for video generation
agent.interrupt - Stop current playback

→ Full Client Message Reference

Server → Client (WebSocket)

connection.established - Session ready
agent.speak.confirmed - Audio received confirmation
agent.interrupt.confirmed - Interrupt confirmed
error - Error occurred

→ Full Server Message Reference

Common Errors

Error	Cause	Solution
`First message must be session.init`	WebSocket initialized incorrectly	Send `session.init` as the first message after connecting
`Invalid room platform`	Missing/invalid `room.platform`	Set `room.platform` to `daily`
`Session not found`	Invalid/expired session	Create new session
`Failed to process audio`	Wrong audio format	Use 16-bit mono PCM and set the correct `sample_rate`
`No audio data provided`	Empty audio field	Include base64-encoded audio

→ Complete Error Reference

Complete Examples

We provide full working examples for multiple platforms:

Browser (HTML + JavaScript)

<video id="avatar-video" autoplay playsinline></video>
<script>
  // Full implementation with UI controls
  // See complete example →
</script>

Python Client

import asyncio
from avatar_client import AvatarClient

async def main():
    # Get ws_uri from Start Session API
    client = AvatarClient(session_id, access_token, ws_uri)
    await client.connect()
    await client.send_audio("speech.wav")
    # Video appears in Daily.co room automatically

Node.js Client

const AvatarClient = require('./avatar-client');

// Get ws_uri from Start Session API
const client = new AvatarClient(sessionId, accessToken, wsUri);
await client.connect();
await client.sendAudio('speech.wav');
// Video appears in Daily.co room automatically

→ View All Complete Examples

Best Practices

Session Management

Always send session.init first after connecting to WebSocket
Reuse sessions for multiple audio requests
When finished: close the WebSocket, leave the Daily.co room, and call POST /v1/sessions/{session_id}/end
Implement reconnection logic with exponential backoff

Audio Quality

Use clean audio without background noise
Normalize audio levels to prevent clipping
Send 5-10 second chunks for optimal performance
Validate format before sending (16-bit, mono PCM + correct sample_rate, e.g. 48000)

Daily.co Connection

Join the Daily.co room before sending audio
Monitor participant events to know when avatar joins
Handle network disconnections gracefully
The avatar server automatically joins as a participant

Performance

Keep WebSocket connections alive for multiple requests
Implement proper error handling
Monitor network quality
Use appropriate chunk sizes

→ Complete Best Practices Guide

Troubleshooting

Daily.co Connection Issues

Verify Daily.co room URL and token are valid
Check network connectivity
Monitor participant events in Daily.co room

Audio/Video Sync Issues

Check network latency
Ensure consistent audio format (48kHz, 16-bit, mono PCM)
Monitor video playback in Daily.co room

Performance Issues

Verify audio format and chunk size
Check network bandwidth
Monitor server response times

→ Full Troubleshooting Guide

Support

Need help? Visit our support portal or check the API status.

Getting Started

Avatars

Sessions

Video Streaming

Schemas

Overview

How It Works

Why This Architecture?

Key Features

Quick Start

1. Create and Start a Session

2. Connect to Daily.co Room (Video)

3. Connect to Avatar Server WebSocket (Audio Commands)

4. Send Audio for Video Generation

Connection Flow

Audio Requirements

Video Output

Message Types Overview

Client → Server (WebSocket)

Server → Client (WebSocket)

Common Errors

Complete Examples

Browser (HTML + JavaScript)

Python Client

Node.js Client

Best Practices

Session Management

Audio Quality

Daily.co Connection

Performance

Troubleshooting

Daily.co Connection Issues

Audio/Video Sync Issues

Performance Issues

Support

Getting Started

Avatars

Sessions

Video Streaming

Schemas

​Overview

​How It Works

​Why This Architecture?

​Key Features

​Quick Start

​1. Create and Start a Session

​2. Connect to Daily.co Room (Video)

​3. Connect to Avatar Server WebSocket (Audio Commands)

​4. Send Audio for Video Generation

​Connection Flow

​Audio Requirements

​Video Output

​Message Types Overview

​Client → Server (WebSocket)

​Server → Client (WebSocket)

​Common Errors

​Complete Examples

​Browser (HTML + JavaScript)

​Python Client

​Node.js Client

​Best Practices

​Session Management

​Audio Quality

​Daily.co Connection

​Performance

​Troubleshooting

​Daily.co Connection Issues

​Audio/Video Sync Issues

​Performance Issues

​Support

Overview

How It Works

Why This Architecture?

Key Features

Quick Start

1. Create and Start a Session

2. Connect to Daily.co Room (Video)

3. Connect to Avatar Server WebSocket (Audio Commands)

4. Send Audio for Video Generation

Connection Flow

Audio Requirements

Video Output

Message Types Overview

Client → Server (WebSocket)

Server → Client (WebSocket)

Common Errors

Complete Examples

Browser (HTML + JavaScript)

Python Client

Node.js Client

Best Practices

Session Management

Audio Quality

Daily.co Connection

Performance

Troubleshooting

Daily.co Connection Issues

Audio/Video Sync Issues

Performance Issues

Support