Building a Full-Stack LiveKit Voice AI Agent with Twilio SIP and Simli Avatars
Project: LiveKit Voice AI Agent (livekit-1 → agent-debate)
Tech Stack: Python, LiveKit, Deepgram STT, OpenAI LLM, Cartesia TTS, Twilio SIP, Simli, Docker, ngrok, AWS S3, MongoDB
Category: Voice AI, Real-Time Communication, Agentic Systems
Background
This was an ambitious project to build a real-time AI voice agent system capable of handling phone calls via Twilio SIP, displaying as a speaking avatar (Simli), supporting multi-agent handoff, and eventually evolving into "Agentic Wars" — a platform where AI agents debate each other in real time.
Challenge 1: Full-Stack LiveKit Setup from Scratch
Problem: Needed to simultaneously set up LiveKit server, ngrok tunnels, Docker Compose stack, SIP integration, and a JavaScript frontend — all interdependent.
Solution: Claude implemented the entire system in parallel:
- LiveKit server + nginx reverse proxy in Docker Compose
- ngrok tunnels for external access (browser + Simli avatar)
- Python agent workers:
avatar_agent.pywith Simli face rendering - Twilio → LiveKit SIP trunk configuration
Architecture:
Twilio Phone → SIP → LiveKit Server → Python Agent Worker
↓
Deepgram STT → OpenAI → Cartesia TTS
↓
Simli Avatar (WebRTC)Challenge 2: Connecting Twilio SIP to LiveKit
Problem: SIP trunk configuration between Twilio and LiveKit requires specific codec, DTMF, and auth settings that aren't well-documented.
Solution: Configured the SIP trunk with:
PCMU/PCMAcodecs (standard telephony)- LiveKit SIP dispatch rules for routing incoming calls to agent rooms
- Ngrok HTTPS endpoint as Twilio webhook target (dev environment)
Challenge 3: Recording Storage Architecture
Problem: Generated video recordings and agent interaction outputs needed to be stored and browsable — initially stored locally, then migrated to AWS S3.
Solution:
- Phase 1: Local storage with static file serving
- Phase 2: AWS S3 upload on recording completion, MongoDB document with video metadata (S3 URL, duration, participants, timestamp)
- Phase 3: Built a React UI to browse all generated videos from MongoDB
# After recording completes
s3_url = upload_to_s3(recording_path, bucket="agent-wars-recordings")
db.recordings.insert_one({
"s3_url": s3_url,
"participants": ["agent_1", "agent_2"],
"duration": duration_secs,
"timestamp": datetime.utcnow()
})Challenge 4: Multi-Agent Handoff
Problem: A single agent needed to hand off a conversation to a specialized agent (e.g., from general assistant to a domain expert) without dropping the call.
Solution: Used LiveKit's room-based architecture — agents join/leave rooms, and a dispatch agent coordinates routing based on conversation context. The avatar_agent.py and specialist agents share the same room briefly during handoff.
Key Learnings
- LiveKit's Python SDK handles the heavy lifting for WebRTC — focus on agent logic, not media negotiation
- ngrok is indispensable for local Twilio SIP dev/testing
- MongoDB + S3 is a natural pairing for media metadata + file storage
- Real-time avatar systems (Simli) add significant engagement to voice AI demos
Session Date: Early 2026 | Stack: LiveKit + Twilio + Simli + AWS
