Building a Full-Stack LiveKit Voice AI Agent with Twilio SIP and Simli Avatars

Project: LiveKit Voice AI Agent (livekit-1 → agent-debate)

Tech Stack: Python, LiveKit, Deepgram STT, OpenAI LLM, Cartesia TTS, Twilio SIP, Simli, Docker, ngrok, AWS S3, MongoDB

Category: Voice AI, Real-Time Communication, Agentic Systems


Background

This was an ambitious project to build a real-time AI voice agent system capable of handling phone calls via Twilio SIP, displaying as a speaking avatar (Simli), supporting multi-agent handoff, and eventually evolving into "Agentic Wars" — a platform where AI agents debate each other in real time.


Challenge 1: Full-Stack LiveKit Setup from Scratch

Problem: Needed to simultaneously set up LiveKit server, ngrok tunnels, Docker Compose stack, SIP integration, and a JavaScript frontend — all interdependent.

Solution: Claude implemented the entire system in parallel:

  • LiveKit server + nginx reverse proxy in Docker Compose
  • ngrok tunnels for external access (browser + Simli avatar)
  • Python agent workers: avatar_agent.py with Simli face rendering
  • Twilio → LiveKit SIP trunk configuration

Architecture:

Twilio Phone → SIP → LiveKit Server → Python Agent Worker
                                    ↓
                            Deepgram STT → OpenAI → Cartesia TTS
                                    ↓
                              Simli Avatar (WebRTC)

Challenge 2: Connecting Twilio SIP to LiveKit

Problem: SIP trunk configuration between Twilio and LiveKit requires specific codec, DTMF, and auth settings that aren't well-documented.

Solution: Configured the SIP trunk with:

  • PCMU/PCMA codecs (standard telephony)
  • LiveKit SIP dispatch rules for routing incoming calls to agent rooms
  • Ngrok HTTPS endpoint as Twilio webhook target (dev environment)

Challenge 3: Recording Storage Architecture

Problem: Generated video recordings and agent interaction outputs needed to be stored and browsable — initially stored locally, then migrated to AWS S3.

Solution:

  1. Phase 1: Local storage with static file serving
  2. Phase 2: AWS S3 upload on recording completion, MongoDB document with video metadata (S3 URL, duration, participants, timestamp)
  3. Phase 3: Built a React UI to browse all generated videos from MongoDB
python
# After recording completes
s3_url = upload_to_s3(recording_path, bucket="agent-wars-recordings")
db.recordings.insert_one({
    "s3_url": s3_url,
    "participants": ["agent_1", "agent_2"],
    "duration": duration_secs,
    "timestamp": datetime.utcnow()
})

Challenge 4: Multi-Agent Handoff

Problem: A single agent needed to hand off a conversation to a specialized agent (e.g., from general assistant to a domain expert) without dropping the call.

Solution: Used LiveKit's room-based architecture — agents join/leave rooms, and a dispatch agent coordinates routing based on conversation context. The avatar_agent.py and specialist agents share the same room briefly during handoff.


Key Learnings

  • LiveKit's Python SDK handles the heavy lifting for WebRTC — focus on agent logic, not media negotiation
  • ngrok is indispensable for local Twilio SIP dev/testing
  • MongoDB + S3 is a natural pairing for media metadata + file storage
  • Real-time avatar systems (Simli) add significant engagement to voice AI demos

Session Date: Early 2026 | Stack: LiveKit + Twilio + Simli + AWS