Building a Real-Time AI Coaching Platform for Telemarketing Agents
Sales calls are high-pressure, fast-moving, and full of signals that most agents miss in the moment. What if an AI could listen to every call, understand the conversation in real time, and quietly coach the agent — all before the next sentence is spoken?
That's exactly what we built. This post covers the architecture, key engineering decisions, and lessons learned from building a production-grade real-time AI coaching system for telemarketing teams.
The Problem
Telemarketing agents handle dozens of calls a day. Each call is different — different customer profile, different objection, different buying signal. Traditional coaching happens after the call, which is too late. Supervisors can't monitor every call in real time. Script cards don't adapt to context.
We needed a system that could:
- Listen to live call audio from a PBX/dialer
- Transcribe both the agent and the customer in real time
- Generate contextual coaching (objections, next best action, sentiment) every few seconds
- Surface insights on the agent's dashboard while the call is still happening
System Architecture
At a high level, audio flows in, gets transcribed, gets analyzed, and results get pushed to the agent's browser — all within seconds.
PBX / Dialer (Audio Source)
↓
Backend API Server
├─ Dual-channel ASR (Agent + Customer streams)
├─ Rolling transcript buffer
├─ LLM Inference (batched, every 15s)
├─ Business dossier context injection
└─ WebSocket push → Agent Dashboard
↓
React Dashboard (Live UI)
└─ Transcript, Coaching, Dossier, Call SummaryDual-Channel Audio Ingestion
One of the earliest decisions was how to handle audio. Most simple approaches merge agent and customer audio into one stream, which makes diarization (speaker separation) unreliable.
We took a different route: two independent WebSocket connections per call, one for the agent channel and one for the customer channel. The audio source (a PBX or softphone adapter) pushes raw binary PCM frames to:
/audio/ingest?call_id=XXXX&channel=agent
/audio/ingest?call_id=XXXX&channel=customerEach channel gets its own ASR instance. This gives us clean, speaker-labeled transcripts without any post-processing diarization heuristics. The tradeoff is double the ASR connections, but the accuracy gain is well worth it.
Audio format: PCM s16le, mono, 8kHz (telephony standard) or 16kHz (wideband). Frames are forwarded directly to the ASR WebSocket — no intermediate buffering.
Speech-to-Text: Streaming ASR
We use a streaming ASR API that supports real-time transcription with VAD (voice activity detection). Key design choices:
- Per-channel ASR connections with exponential backoff reconnection (2s → 4s → 8s, 3 retries)
- Speaker labels injected at the source: transcripts from the agent channel are tagged
Agent, customer channel taggedCustomer - The ASR emits partial and final transcripts; only final transcripts are forwarded to the coaching pipeline
The Coaching Pipeline
This is where the interesting engineering lives.
Rolling Transcript Buffer
We maintain a token-limited rolling window of the transcript — capped at ~600 tokens. As new transcript segments arrive, older ones are trimmed from the front. This ensures:
- The LLM always sees the most recent, relevant context
- We never exceed the model's effective context for this task
- At least one segment is always preserved (guards against edge cases)
Batching & Throttling
Calling the LLM on every new utterance is wasteful and introduces jitter. Instead, we use a 15-second batch window: transcript segments accumulate, and at the end of each window, a single LLM call is made with the full rolling buffer.
An additional 10-second throttle gate prevents overlapping calls if the previous inference ran long. This keeps the system stable under load without queue buildup.
Transcript arrives
↓
TranscriptAggregator (15s window)
↓
[10s throttle gate]
↓
LLM Inference
↓
Push to dashboard via WebSocketBusiness Context Injection
Each call is associated with a business dossier — structured data about the lead (product interest, location, previous interactions, etc.). Before inference starts, we fetch this dossier from an internal API, compress it to ~250 tokens, and include it in every LLM prompt.
Dossiers are cached in Redis with a 1-hour TTL. On cache miss, we do a live fetch with a fallback lookup strategy.
LLM Prompt Structure
The prompt is assembled from:
- A system prompt defining the coaching persona and output schema
- The compressed business dossier
- The rolling transcript buffer (speaker-labeled)
The LLM returns structured JSON:
{
"sentiment": "positive",
"buying_intent_score": 7,
"detected_objections": ["price too high"],
"product_suggestions": ["Premium Plan", "EMI Option"],
"script_hints": "Acknowledge the price concern, pivot to value",
"compliance_flags": [],
"next_best_action": "Offer a limited-time discount"
}LLM Provider Flexibility
We built a provider manager that supports runtime switching between LLM backends — currently a locally-hosted open-source model (via OpenAI-compatible API) and Google Gemini. Providers can be switched between calls without restarting the server. Active calls remain locked to the provider they started with.
Each provider has its own:
- Circuit breaker (5 consecutive failures → 30s pause)
- Retry logic (1 retry on 5xx, 500ms delay)
- Timeout budget (12s for inference, 10s for post-call summary)
Session Lifecycle
Every call maps to a session, which is a state machine:
CREATED → LOADING_CONTEXT → WAITING_FOR_AUDIO → STREAMING → ENDING → COMPLETED- CREATED: Session registered,
call_idassigned - LOADING_CONTEXT: Dossier + interaction history fetched
- WAITING_FOR_AUDIO: Ready, waiting for first audio connection
- STREAMING: Both channels connected, ASR active, coaching pipeline running
- ENDING: Graceful shutdown, post-call summary generated, MongoDB record written
- COMPLETED: Resources cleaned up, Redis keys removed
Sessions can be created via REST (POST /api/v1/call/initiate) or lazily on the first WebSocket audio connection — useful for dialers that can't do a pre-call REST handshake.
Orphan cleanup runs periodically: any session stuck in STREAMING for over 2 hours is auto-ended.
Real-Time Push to the Dashboard
The agent dashboard connects to a WebSocket endpoint and receives a stream of typed events:
| Event | Description |
|---|---|
session_info | Session metadata on connect |
transcript_final | New speaker-labeled transcript segment |
ai_response | Full coaching JSON from LLM |
lead_dossier | Business context for the current lead |
call_state_change | State machine transitions |
call_summary | Post-call summary on end |
The frontend accumulates transcript segments and AI responses into a timeline. The latest ai_response is always surfaced prominently; older responses are collapsed into an inference history panel.
Audio Recording (Optional)
For QA and training purposes, we optionally record calls to S3 as MP3 files. This is entirely fire-and-forget — it never blocks the main call flow.
Each call produces three files:
agent.mp3— mono agent channelcustomer.mp3— mono customer channelmixed.mp3— stereo blend
PCM chunks are buffered in memory (capped at 50 MB per channel) during the call, then encoded to MP3 using lamejs and uploaded on call end.
Frontend: The Agent Dashboard
The React dashboard has two main views:
Calls List — Shows active sessions, a health indicator (DB latency), and a form to start a new call.
Live Call View — The main workspace:
- Left panel: real-time scrolling transcript with speaker labels and timestamps
- Right sidebar: latest AI coaching card (sentiment, intent score, objections, suggestions, compliance)
- Bottom panel: expandable inference history (full timeline of LLM outputs)
- Dossier panel: lead context with pinnable/hideable fields
- Audio relay controls: toggle live audio playback with per-channel volume and mute
State is managed entirely via a single useCallSession hook — no Redux or Zustand. The hook owns the WebSocket connection, transcript accumulation, coaching response history, and call lifecycle.
Key Engineering Lessons
1. Separate audio channels from the start.
Don't merge and diarize — route from the source. The accuracy difference is significant for downstream NLP.
2. Batch LLM calls aggressively.
Per-utterance inference sounds appealing but creates latency spikes and rate-limit pressure. A 15-second window gives the LLM meaningful context and keeps costs predictable.
3. Fire-and-forget non-blocking work.
S3 uploads, Redis persistence, analytics logging — none of these should touch the hot path. Use async queues or detached promises.
4. Build your own primitives before reaching for a framework.
LangChain and similar agentic frameworks add significant overhead (latency + abstraction). For a single-turn structured extraction task with hard latency requirements, a well-typed direct API call is the right choice.
5. Design for provider swappability early.
We abstracted LLM providers behind an interface from day one. Switching between a local model and a cloud model at runtime — without disrupting active calls — proved invaluable for cost and reliability management.
What's Next
- Tool calling for dynamic dossier enrichment — let the LLM request specific lead attributes on demand rather than bulk-injecting everything
- Streaming LLM output — partial coaching tokens streamed to the dashboard as the model generates them
- Offline analytics agent — a separate pipeline to analyze call patterns, coaching quality, and conversion signals across sessions
Real-time AI coaching for sales isn't a moonshot anymore. With the right architecture — dual-channel ASR, tight batching, structured LLM output, and a push-based frontend — you can close the feedback loop from "customer said X" to "agent sees coaching" in under 15 seconds. The calls are happening anyway. You might as well make them smarter.
