Deploying Llama 8B as a FastAPI Inference Server
Project: Sarvam Inference — Llama 8B
Tech Stack: Python, FastAPI, uvicorn, HuggingFace Transformers, CUDA
Category: LLM Deployment, API Development
Background
Deployed Meta's Llama 3.1 8B Instruct as a self-hosted FastAPI inference server to serve as a backend for various internal tools. Documented the full API and got it working end-to-end.
Challenge 1: FastAPI App Exits Silently When Run with `python`
Problem:
python server.py
# (exits immediately, no output)Root cause: FastAPI apps require an ASGI server (uvicorn). Running the module directly with python executes the module but nothing starts the server unless there's a __main__ block.
Solution: Added the main block:
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)Now both work:
python server.py # uses __main__ block
uvicorn server:app ... # direct uvicorn invocationChallenge 2: API Documentation and Postman-Ready Samples
Problem: Team members needed to call the API from Postman but had no documentation.
Solution: Generated API_DOCS.md with full curl examples:
# Health Check
curl -X GET http://localhost:8000/health
# Basic inference
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}'
# With system prompt
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a Python function to reverse a string."}
]
}'Challenge 3: Server Running on Different IP than Docs
Problem: The server was running on 10.10.0.10:5001 (internal network IP) but all curl examples in the docs used localhost:8000.
Solution: Updated API_DOCS.md with the correct host/port after the team confirmed the production endpoint.
Validation
Successful API response confirmed in Postman:
- Status: 200 OK
- Latency: 3.04 seconds
- Response: Model correctly added a docstring to a
reverse_stringfunction — confirming multi-turn conversation context was working
Key Learnings
- Always add
if __name__ == "__main__"to FastAPI apps to support both direct and uvicorn execution - Postman-ready curl examples in docs save significant onboarding time
- Separate health check endpoint (
/health) is essential for Docker healthchecks and load balancers
Session Date: Early 2026 | Model: Llama 3.1 8B Instruct | Host: 10.10.0.10:5001
