Building a Multi-Endpoint LLM Call Analysis API with Llama 3.1
Project: Sentiment Analysis / Call Analysis API
Tech Stack: Python, FastAPI, Meta Llama 3.1 8B Instruct, Docker, MongoDB
Category: LLM Applications, Call Center AI, API Design
Background
JustDial's telemarketing team needed automated AI analysis of call transcripts. The system had to evaluate agent quality across multiple dimensions: sentiment, disposition accuracy, and pitched product identification — all using an on-premise LLM (Llama 3.1 8B).
Challenge 1: Single Endpoint Monolith to Clean Multi-Endpoint API
Problem: The original code had all analysis logic jammed into one /analyze-sentiment endpoint. Adding disposition and product analysis would make it unmaintainable.
Solution: Claude refactored into a clean modular structure:
sentiment_analysis/
├── main.py
├── config.py
├── models/schemas.py
├── services/
│ ├── llm.py # shared LLM connection
│ ├── sentiment.py
│ ├── disposition.py
│ └── product.py
└── routers/
├── sentiment.py
├── disposition.py
└── product.pyThree distinct endpoints:
POST /analyze-sentiment— positive/negative/neutral with confidencePOST /analyze-disposition— checks if dialer disposition matches what actually happenedPOST /analyze-pitched-product— identifies which product was pitched
Challenge 2: Disposition Accuracy Verification
Problem: Agents manually mark call dispositions (e.g., "Interested", "Not Interested", "Callback") in the dialer. Management needed AI verification of whether these were marked correctly.
Solution: LLM prompt engineering to compare transcript against disposition:
prompt = f"""
Transcription: {transcription}
Dialer Disposition: {dialer_disposition}
Based on the transcription, was the disposition marked correctly?
Respond with:
- is_disposition_marked_correctly: "yes" or "no"
- correct_disposition: <what it should be>
- reason: <brief explanation>
"""Challenge 3: Extending to Call Analysis & Scoring
Problem: The sentiment API was then used as the base for a broader call quality scoring system that combined multiple analyses into a single quality score per agent per call.
Solution: The call-analysis-combined project built on top of sentiment analysis by:
- Running all three analyses in parallel
- Applying weighted scoring rubrics per analysis dimension
- Generating a final TME (Telemarketing Executive) quality score
- Storing results in MongoDB for trend analysis
Challenge 4: OOM Killer Taking Down the Flask Server
Problem: The production ML server was killed by Linux's OOM killer (flask process, port 10101). This happened silently — container would just stop responding.
Solution: Migrated from raw python app.py to Gunicorn with memory controls:
CMD ["gunicorn", "main_app:app", "-b", "0.0.0.0:10101", "-w", "4", "--timeout", "120"]Added to docker-compose.yml:
restart: always
mem_limit: 8g
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:10101/health"]
interval: 30s
retries: 3Gunicorn as PID 1 means Docker's restart: always actually works — when a worker gets OOM-killed, Gunicorn spawns a replacement instead of the whole container dying.
Key Learnings
- LLM-based disposition verification can dramatically improve call center quality assurance
- Modular router/service architecture is essential for multi-endpoint ML APIs
- Gunicorn is mandatory for production Python ML services — never run with bare
python - Memory limits + health checks are non-negotiable for ML containers
Session Date: Early 2026 | LLM: Meta Llama 3.1 8B Instruct
