Skip to main content

Overview

Create real-time AI avatar conversations by combining two simple components: a Daily.co room for video delivery and a WebSocket for audio commands.

How It Works

You → Send Audio (WebSocket) → Avatar Server

                               Generates Video

                        Streams to Daily.co Room

You ← Receive Video (Daily.co) ← Daily.co Room
The avatar server does all the heavy lifting:
  1. Receives your audio commands via WebSocket
  2. Generates lifelike talking head video using GPU acceleration
  3. Streams the video to your Daily.co room as a participant
You just need to:
  1. Join the Daily.co room to receive video
  2. Send audio commands via WebSocket

Why This Architecture?

No WebRTC Complexity - Daily.co handles all WebRTC setup (STUN, TURN, ICE, SDP)
Reliable Streaming - Daily.co’s infrastructure ensures smooth video delivery
Simple Integration - Just join a room and send JSON messages
Scalable - Works behind corporate firewalls and restrictive NATs
Low Latency - ~500ms end-to-end from audio to video

Key Features

  • 🎭 Real-time talking head video generation from audio
  • 🚀 GPU-accelerated video processing for low latency
  • 🎥 Professional video quality with synchronized audio
  • 🔄 Multi-session support with independent avatars
  • 🌐 Works globally with Daily.co’s CDN
  • 📱 Cross-platform (web, mobile, desktop)

Quick Start

1. Create and Start a Session

First, create a session using the Create Session endpoint, then start it:
# Create session
curl -X POST https://api.agenthuman.com/v1/sessions \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "avatar_id": "your-avatar-id"
  }'

# Start session (activates avatar server)
curl -X POST https://api.agenthuman.com/v1/sessions/{session_id}/start \
  -H "x-api-key: YOUR_API_KEY"
You’ll receive:
  • session_id - Session identifier
  • access_token - Authentication token
  • daily_room - Daily.co room details (url, name, token)
  • ws_uri - WebSocket URI for the avatar server (after starting)

2. Connect to Daily.co Room (Video)

The avatar server streams video into the Daily.co room. Your client connects to receive it:
// Use Daily.co JavaScript SDK
import DailyIframe from '@daily-co/daily-js';

const callFrame = DailyIframe.createFrame({
  showLeaveButton: true,
  iframeStyle: {
    position: 'fixed',
    width: '100%',
    height: '100%',
  }
});

// Join the Daily.co room with the provided credentials
await callFrame.join({
  url: session.daily_room.url,
  token: session.daily_room.token
});

// The avatar video will automatically appear in the room
callFrame.on('participant-joined', (event) => {
  console.log('Avatar joined the room');
});

callFrame.on('track-started', (event) => {
  console.log('Track started:', event.track.kind);
  // Video and audio tracks from the avatar
});
No WebRTC Setup Required: Daily.co handles all video streaming complexity. You don’t need to manage WebRTC connections - just join the room and send audio commands.

3. Connect to Avatar Server WebSocket (Audio Commands)

Connect to the avatar server WebSocket to send audio commands:
// Connect to avatar server WebSocket
const ws = new WebSocket(ws_uri); // From start session response

ws.onopen = () => {
    console.log('Connected to avatar server');
    
    // Initialize with session credentials + Daily room info
    ws.send(JSON.stringify({
        type: 'session.init',
        config: {
            session_id: sessionId,
            access_token: accessToken,
            room: {
                platform: 'daily',
                url: session.daily_room.url,
                token: session.daily_room.token,
                display_name: 'AI Avatar (AH)'
            },
            video_width: 1280,
            video_height: 720
        }
    }));
};

ws.onmessage = (event) => {
    const message = JSON.parse(event.data);
    
    if (message.type === 'connection.established') {
        console.log('Session initialized, ready to send audio');
    } else if (message.type === 'agent.speak.confirmed') {
        console.log('Audio received:', message.audio_samples, 'samples');
        // Video will appear in Daily.co room automatically
    }
};
WebSocket Purpose: The WebSocket is only for sending audio commands (agent.speak, agent.interrupt). Video is streamed directly to the Daily.co room by the avatar server.

4. Send Audio for Video Generation

Send audio via WebSocket to the avatar server as 16-bit, mono PCM encoded in base64. If you send 48kHz audio, include sample_rate: 48000.
// Convert audio to required format
const audioContext = new AudioContext({ sampleRate: 48000 });
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
const channelData = audioBuffer.getChannelData(0);
const samples = new Int16Array(channelData.length);

// Convert to 16-bit PCM
for (let i = 0; i < channelData.length; i++) {
    const s = Math.max(-1, Math.min(1, channelData[i]));
    samples[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
}

// Encode and send
const base64Audio = btoa(String.fromCharCode(...new Uint8Array(samples.buffer)));

ws.send(JSON.stringify({
    type: 'agent.speak',
    audio: base64Audio,
    sample_rate: 48000
}));

Connection Flow

Key Points:
  1. Client joins Daily.co room to receive video
  2. Client connects to WebSocket to send audio commands
  3. Avatar server automatically joins Daily.co room as a participant
  4. Client sends audio via WebSocket
  5. Avatar server generates and streams video to Daily.co room
  6. Client receives video from Daily.co room (handled by Daily.co infrastructure)

Audio Requirements

ParameterValueDescription
FormatRaw PCMRaw audio bytes (not WAV file)
Sample Rate48000 HzInput sample rate
ChannelsMonoSingle channel audio
Bit Depth16-bit signedAudio bit depth
EncodingBase64Encode raw bytes as base64 string
Max Size100 MBPer WebSocket message
Important: Send raw PCM bytes, not WAV file format. The audio will be automatically resampled to 16kHz internally and synchronized with video playback.

Video Output

ParameterValueDescription
FormatH.264Video codec
Frame Rate25 FPSDefault frame rate
ResolutionAvatar dependentLandscape (16:9), Portrait (9:16), or Square (1:1)
DeliveryDaily.co RoomVideo and audio tracks
Latency~500msEnd-to-end latency

Message Types Overview

Client → Server (WebSocket)

  • session.init - Initialize session with credentials
  • agent.speak - Send audio for video generation
  • agent.interrupt - Stop current playback
→ Full Client Message Reference

Server → Client (WebSocket)

  • connection.established - Session ready
  • agent.speak.confirmed - Audio received confirmation
  • agent.interrupt.confirmed - Interrupt confirmed
  • error - Error occurred
→ Full Server Message Reference

Common Errors

ErrorCauseSolution
First message must be session.initWebSocket initialized incorrectlySend session.init as the first message after connecting
Invalid room platformMissing/invalid room.platformSet room.platform to daily
Session not foundInvalid/expired sessionCreate new session
Failed to process audioWrong audio formatUse 16-bit mono PCM and set the correct sample_rate
No audio data providedEmpty audio fieldInclude base64-encoded audio
→ Complete Error Reference

Complete Examples

We provide full working examples for multiple platforms:

Browser (HTML + JavaScript)

<video id="avatar-video" autoplay playsinline></video>
<script>
  // Full implementation with UI controls
  // See complete example →
</script>

Python Client

import asyncio
from avatar_client import AvatarClient

async def main():
    # Get ws_uri from Start Session API
    client = AvatarClient(session_id, access_token, ws_uri)
    await client.connect()
    await client.send_audio("speech.wav")
    # Video appears in Daily.co room automatically

Node.js Client

const AvatarClient = require('./avatar-client');

// Get ws_uri from Start Session API
const client = new AvatarClient(sessionId, accessToken, wsUri);
await client.connect();
await client.sendAudio('speech.wav');
// Video appears in Daily.co room automatically
→ View All Complete Examples

Best Practices

Session Management

  • Always send session.init first after connecting to WebSocket
  • Reuse sessions for multiple audio requests
  • When finished: close the WebSocket, leave the Daily.co room, and call POST /v1/sessions/{session_id}/end
  • Implement reconnection logic with exponential backoff

Audio Quality

  • Use clean audio without background noise
  • Normalize audio levels to prevent clipping
  • Send 5-10 second chunks for optimal performance
  • Validate format before sending (16-bit, mono PCM + correct sample_rate, e.g. 48000)

Daily.co Connection

  • Join the Daily.co room before sending audio
  • Monitor participant events to know when avatar joins
  • Handle network disconnections gracefully
  • The avatar server automatically joins as a participant

Performance

  • Keep WebSocket connections alive for multiple requests
  • Implement proper error handling
  • Monitor network quality
  • Use appropriate chunk sizes
→ Complete Best Practices Guide

Troubleshooting

Daily.co Connection Issues

  • Verify Daily.co room URL and token are valid
  • Check network connectivity
  • Monitor participant events in Daily.co room

Audio/Video Sync Issues

  • Check network latency
  • Ensure consistent audio format (48kHz, 16-bit, mono PCM)
  • Monitor video playback in Daily.co room

Performance Issues

  • Verify audio format and chunk size
  • Check network bandwidth
  • Monitor server response times
→ Full Troubleshooting Guide

Support

Need help? Visit our support portal or check the API status.