Building a Real-Time AI Coaching Platform for Telemarketing Agents

Sales calls are high-pressure, fast-moving, and full of signals that most agents miss in the moment. What if an AI could listen to every call, understand the conversation in real time, and quietly coach the agent — all before the next sentence is spoken?

That's exactly what we built. This post covers the architecture, key engineering decisions, and lessons learned from building a production-grade real-time AI coaching system for telemarketing teams.


The Problem

Telemarketing agents handle dozens of calls a day. Each call is different — different customer profile, different objection, different buying signal. Traditional coaching happens after the call, which is too late. Supervisors can't monitor every call in real time. Script cards don't adapt to context.

We needed a system that could:

  • Listen to live call audio from a PBX/dialer
  • Transcribe both the agent and the customer in real time
  • Generate contextual coaching (objections, next best action, sentiment) every few seconds
  • Surface insights on the agent's dashboard while the call is still happening

System Architecture

At a high level, audio flows in, gets transcribed, gets analyzed, and results get pushed to the agent's browser — all within seconds.

PBX / Dialer (Audio Source)
        ↓
Backend API Server
├─ Dual-channel ASR (Agent + Customer streams)
├─ Rolling transcript buffer
├─ LLM Inference (batched, every 15s)
├─ Business dossier context injection
└─ WebSocket push → Agent Dashboard
        ↓
React Dashboard (Live UI)
└─ Transcript, Coaching, Dossier, Call Summary

Dual-Channel Audio Ingestion

One of the earliest decisions was how to handle audio. Most simple approaches merge agent and customer audio into one stream, which makes diarization (speaker separation) unreliable.

We took a different route: two independent WebSocket connections per call, one for the agent channel and one for the customer channel. The audio source (a PBX or softphone adapter) pushes raw binary PCM frames to:

/audio/ingest?call_id=XXXX&channel=agent
/audio/ingest?call_id=XXXX&channel=customer

Each channel gets its own ASR instance. This gives us clean, speaker-labeled transcripts without any post-processing diarization heuristics. The tradeoff is double the ASR connections, but the accuracy gain is well worth it.

Audio format: PCM s16le, mono, 8kHz (telephony standard) or 16kHz (wideband). Frames are forwarded directly to the ASR WebSocket — no intermediate buffering.


Speech-to-Text: Streaming ASR

We use a streaming ASR API that supports real-time transcription with VAD (voice activity detection). Key design choices:

  • Per-channel ASR connections with exponential backoff reconnection (2s → 4s → 8s, 3 retries)
  • Speaker labels injected at the source: transcripts from the agent channel are tagged Agent, customer channel tagged Customer
  • The ASR emits partial and final transcripts; only final transcripts are forwarded to the coaching pipeline

The Coaching Pipeline

This is where the interesting engineering lives.

Rolling Transcript Buffer

We maintain a token-limited rolling window of the transcript — capped at ~600 tokens. As new transcript segments arrive, older ones are trimmed from the front. This ensures:

  • The LLM always sees the most recent, relevant context
  • We never exceed the model's effective context for this task
  • At least one segment is always preserved (guards against edge cases)

Batching & Throttling

Calling the LLM on every new utterance is wasteful and introduces jitter. Instead, we use a 15-second batch window: transcript segments accumulate, and at the end of each window, a single LLM call is made with the full rolling buffer.

An additional 10-second throttle gate prevents overlapping calls if the previous inference ran long. This keeps the system stable under load without queue buildup.

Transcript arrives
      ↓
TranscriptAggregator (15s window)
      ↓
[10s throttle gate]
      ↓
LLM Inference
      ↓
Push to dashboard via WebSocket

Business Context Injection

Each call is associated with a business dossier — structured data about the lead (product interest, location, previous interactions, etc.). Before inference starts, we fetch this dossier from an internal API, compress it to ~250 tokens, and include it in every LLM prompt.

Dossiers are cached in Redis with a 1-hour TTL. On cache miss, we do a live fetch with a fallback lookup strategy.

LLM Prompt Structure

The prompt is assembled from:

  1. A system prompt defining the coaching persona and output schema
  2. The compressed business dossier
  3. The rolling transcript buffer (speaker-labeled)

The LLM returns structured JSON:

json
{
  "sentiment": "positive",
  "buying_intent_score": 7,
  "detected_objections": ["price too high"],
  "product_suggestions": ["Premium Plan", "EMI Option"],
  "script_hints": "Acknowledge the price concern, pivot to value",
  "compliance_flags": [],
  "next_best_action": "Offer a limited-time discount"
}

LLM Provider Flexibility

We built a provider manager that supports runtime switching between LLM backends — currently a locally-hosted open-source model (via OpenAI-compatible API) and Google Gemini. Providers can be switched between calls without restarting the server. Active calls remain locked to the provider they started with.

Each provider has its own:

  • Circuit breaker (5 consecutive failures → 30s pause)
  • Retry logic (1 retry on 5xx, 500ms delay)
  • Timeout budget (12s for inference, 10s for post-call summary)

Session Lifecycle

Every call maps to a session, which is a state machine:

CREATED → LOADING_CONTEXT → WAITING_FOR_AUDIO → STREAMING → ENDING → COMPLETED
  • CREATED: Session registered, call_id assigned
  • LOADING_CONTEXT: Dossier + interaction history fetched
  • WAITING_FOR_AUDIO: Ready, waiting for first audio connection
  • STREAMING: Both channels connected, ASR active, coaching pipeline running
  • ENDING: Graceful shutdown, post-call summary generated, MongoDB record written
  • COMPLETED: Resources cleaned up, Redis keys removed

Sessions can be created via REST (POST /api/v1/call/initiate) or lazily on the first WebSocket audio connection — useful for dialers that can't do a pre-call REST handshake.

Orphan cleanup runs periodically: any session stuck in STREAMING for over 2 hours is auto-ended.


Real-Time Push to the Dashboard

The agent dashboard connects to a WebSocket endpoint and receives a stream of typed events:

EventDescription
session_infoSession metadata on connect
transcript_finalNew speaker-labeled transcript segment
ai_responseFull coaching JSON from LLM
lead_dossierBusiness context for the current lead
call_state_changeState machine transitions
call_summaryPost-call summary on end

The frontend accumulates transcript segments and AI responses into a timeline. The latest ai_response is always surfaced prominently; older responses are collapsed into an inference history panel.


Audio Recording (Optional)

For QA and training purposes, we optionally record calls to S3 as MP3 files. This is entirely fire-and-forget — it never blocks the main call flow.

Each call produces three files:

  • agent.mp3 — mono agent channel
  • customer.mp3 — mono customer channel
  • mixed.mp3 — stereo blend

PCM chunks are buffered in memory (capped at 50 MB per channel) during the call, then encoded to MP3 using lamejs and uploaded on call end.


Frontend: The Agent Dashboard

The React dashboard has two main views:

Calls List — Shows active sessions, a health indicator (DB latency), and a form to start a new call.

Live Call View — The main workspace:

  • Left panel: real-time scrolling transcript with speaker labels and timestamps
  • Right sidebar: latest AI coaching card (sentiment, intent score, objections, suggestions, compliance)
  • Bottom panel: expandable inference history (full timeline of LLM outputs)
  • Dossier panel: lead context with pinnable/hideable fields
  • Audio relay controls: toggle live audio playback with per-channel volume and mute

State is managed entirely via a single useCallSession hook — no Redux or Zustand. The hook owns the WebSocket connection, transcript accumulation, coaching response history, and call lifecycle.


Key Engineering Lessons

1. Separate audio channels from the start.

Don't merge and diarize — route from the source. The accuracy difference is significant for downstream NLP.

2. Batch LLM calls aggressively.

Per-utterance inference sounds appealing but creates latency spikes and rate-limit pressure. A 15-second window gives the LLM meaningful context and keeps costs predictable.

3. Fire-and-forget non-blocking work.

S3 uploads, Redis persistence, analytics logging — none of these should touch the hot path. Use async queues or detached promises.

4. Build your own primitives before reaching for a framework.

LangChain and similar agentic frameworks add significant overhead (latency + abstraction). For a single-turn structured extraction task with hard latency requirements, a well-typed direct API call is the right choice.

5. Design for provider swappability early.

We abstracted LLM providers behind an interface from day one. Switching between a local model and a cloud model at runtime — without disrupting active calls — proved invaluable for cost and reliability management.


What's Next

  • Tool calling for dynamic dossier enrichment — let the LLM request specific lead attributes on demand rather than bulk-injecting everything
  • Streaming LLM output — partial coaching tokens streamed to the dashboard as the model generates them
  • Offline analytics agent — a separate pipeline to analyze call patterns, coaching quality, and conversion signals across sessions

Real-time AI coaching for sales isn't a moonshot anymore. With the right architecture — dual-channel ASR, tight batching, structured LLM output, and a push-based frontend — you can close the feedback loop from "customer said X" to "agent sees coaching" in under 15 seconds. The calls are happening anyway. You might as well make them smarter.