Building a Full-Stack LiveKit Voice AI Agent with Twilio SIP and Simli Avatars

Project: LiveKit Voice AI Agent (livekit-1 → agent-debate)

Tech Stack: Python, LiveKit, Deepgram STT, OpenAI LLM, Cartesia TTS, Twilio SIP, Simli, Docker, ngrok, AWS S3, MongoDB

Category: Voice AI, Real-Time Communication, Agentic Systems

Background

This was an ambitious project to build a real-time AI voice agent system capable of handling phone calls via Twilio SIP, displaying as a speaking avatar (Simli), supporting multi-agent handoff, and eventually evolving into "Agentic Wars" — a platform where AI agents debate each other in real time.

Challenge 1: Full-Stack LiveKit Setup from Scratch

Problem: Needed to simultaneously set up LiveKit server, ngrok tunnels, Docker Compose stack, SIP integration, and a JavaScript frontend — all interdependent.

Solution: Claude implemented the entire system in parallel:

LiveKit server + nginx reverse proxy in Docker Compose
ngrok tunnels for external access (browser + Simli avatar)
Python agent workers: avatar_agent.py with Simli face rendering
Twilio → LiveKit SIP trunk configuration

Architecture:

Twilio Phone → SIP → LiveKit Server → Python Agent Worker
                                    ↓
                            Deepgram STT → OpenAI → Cartesia TTS
                                    ↓
                              Simli Avatar (WebRTC)

Challenge 2: Connecting Twilio SIP to LiveKit

Problem: SIP trunk configuration between Twilio and LiveKit requires specific codec, DTMF, and auth settings that aren't well-documented.

Solution: Configured the SIP trunk with:

PCMU/PCMA codecs (standard telephony)
LiveKit SIP dispatch rules for routing incoming calls to agent rooms
Ngrok HTTPS endpoint as Twilio webhook target (dev environment)

Challenge 3: Recording Storage Architecture

Problem: Generated video recordings and agent interaction outputs needed to be stored and browsable — initially stored locally, then migrated to AWS S3.

Solution:

Phase 1: Local storage with static file serving
Phase 2: AWS S3 upload on recording completion, MongoDB document with video metadata (S3 URL, duration, participants, timestamp)
Phase 3: Built a React UI to browse all generated videos from MongoDB

python

# After recording completes
s3_url = upload_to_s3(recording_path, bucket="agent-wars-recordings")
db.recordings.insert_one({
    "s3_url": s3_url,
    "participants": ["agent_1", "agent_2"],
    "duration": duration_secs,
    "timestamp": datetime.utcnow()
})

Challenge 4: Multi-Agent Handoff

Problem: A single agent needed to hand off a conversation to a specialized agent (e.g., from general assistant to a domain expert) without dropping the call.

Solution: Used LiveKit's room-based architecture — agents join/leave rooms, and a dispatch agent coordinates routing based on conversation context. The avatar_agent.py and specialist agents share the same room briefly during handoff.

Key Learnings

LiveKit's Python SDK handles the heavy lifting for WebRTC — focus on agent logic, not media negotiation
ngrok is indispensable for local Twilio SIP dev/testing
MongoDB + S3 is a natural pairing for media metadata + file storage
Real-time avatar systems (Simli) add significant engagement to voice AI demos

Session Date: Early 2026 | Stack: LiveKit + Twilio + Simli + AWS