Building a Multi-Endpoint LLM Call Analysis API with Llama 3.1

Project: Sentiment Analysis / Call Analysis API

Tech Stack: Python, FastAPI, Meta Llama 3.1 8B Instruct, Docker, MongoDB

Category: LLM Applications, Call Center AI, API Design

Background

JustDial's telemarketing team needed automated AI analysis of call transcripts. The system had to evaluate agent quality across multiple dimensions: sentiment, disposition accuracy, and pitched product identification — all using an on-premise LLM (Llama 3.1 8B).

Challenge 1: Single Endpoint Monolith to Clean Multi-Endpoint API

Problem: The original code had all analysis logic jammed into one /analyze-sentiment endpoint. Adding disposition and product analysis would make it unmaintainable.

Solution: Claude refactored into a clean modular structure:

sentiment_analysis/
├── main.py
├── config.py
├── models/schemas.py
├── services/
│   ├── llm.py          # shared LLM connection
│   ├── sentiment.py
│   ├── disposition.py
│   └── product.py
└── routers/
    ├── sentiment.py
    ├── disposition.py
    └── product.py

Three distinct endpoints:

POST /analyze-sentiment — positive/negative/neutral with confidence
POST /analyze-disposition — checks if dialer disposition matches what actually happened
POST /analyze-pitched-product — identifies which product was pitched

Challenge 2: Disposition Accuracy Verification

Problem: Agents manually mark call dispositions (e.g., "Interested", "Not Interested", "Callback") in the dialer. Management needed AI verification of whether these were marked correctly.

Solution: LLM prompt engineering to compare transcript against disposition:

python

prompt = f"""
Transcription: {transcription}
Dialer Disposition: {dialer_disposition}

Based on the transcription, was the disposition marked correctly?
Respond with:
- is_disposition_marked_correctly: "yes" or "no"
- correct_disposition: <what it should be>
- reason: <brief explanation>
"""

Challenge 3: Extending to Call Analysis & Scoring

Problem: The sentiment API was then used as the base for a broader call quality scoring system that combined multiple analyses into a single quality score per agent per call.

Solution: The call-analysis-combined project built on top of sentiment analysis by:

Running all three analyses in parallel
Applying weighted scoring rubrics per analysis dimension
Generating a final TME (Telemarketing Executive) quality score
Storing results in MongoDB for trend analysis

Challenge 4: OOM Killer Taking Down the Flask Server

Problem: The production ML server was killed by Linux's OOM killer (flask process, port 10101). This happened silently — container would just stop responding.

Solution: Migrated from raw python app.py to Gunicorn with memory controls:

dockerfile

CMD ["gunicorn", "main_app:app", "-b", "0.0.0.0:10101", "-w", "4", "--timeout", "120"]

Added to docker-compose.yml:

yaml

restart: always
mem_limit: 8g
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:10101/health"]
  interval: 30s
  retries: 3

Gunicorn as PID 1 means Docker's restart: always actually works — when a worker gets OOM-killed, Gunicorn spawns a replacement instead of the whole container dying.

Key Learnings

LLM-based disposition verification can dramatically improve call center quality assurance
Modular router/service architecture is essential for multi-endpoint ML APIs
Gunicorn is mandatory for production Python ML services — never run with bare python
Memory limits + health checks are non-negotiable for ML containers

Session Date: Early 2026 | LLM: Meta Llama 3.1 8B Instruct