Deploying Llama 8B as a FastAPI Inference Server

Project: Sarvam Inference — Llama 8B

Tech Stack: Python, FastAPI, uvicorn, HuggingFace Transformers, CUDA

Category: LLM Deployment, API Development


Background

Deployed Meta's Llama 3.1 8B Instruct as a self-hosted FastAPI inference server to serve as a backend for various internal tools. Documented the full API and got it working end-to-end.


Challenge 1: FastAPI App Exits Silently When Run with `python`

Problem:

bash
python server.py
# (exits immediately, no output)

Root cause: FastAPI apps require an ASGI server (uvicorn). Running the module directly with python executes the module but nothing starts the server unless there's a __main__ block.

Solution: Added the main block:

python
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Now both work:

bash
python server.py          # uses __main__ block
uvicorn server:app ...    # direct uvicorn invocation

Challenge 2: API Documentation and Postman-Ready Samples

Problem: Team members needed to call the API from Postman but had no documentation.

Solution: Generated API_DOCS.md with full curl examples:

bash
# Health Check
curl -X GET http://localhost:8000/health

# Basic inference
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

# With system prompt
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a Python function to reverse a string."}
    ]
  }'

Challenge 3: Server Running on Different IP than Docs

Problem: The server was running on 10.10.0.10:5001 (internal network IP) but all curl examples in the docs used localhost:8000.

Solution: Updated API_DOCS.md with the correct host/port after the team confirmed the production endpoint.


Validation

Successful API response confirmed in Postman:

  • Status: 200 OK
  • Latency: 3.04 seconds
  • Response: Model correctly added a docstring to a reverse_string function — confirming multi-turn conversation context was working

Key Learnings

  • Always add if __name__ == "__main__" to FastAPI apps to support both direct and uvicorn execution
  • Postman-ready curl examples in docs save significant onboarding time
  • Separate health check endpoint (/health) is essential for Docker healthchecks and load balancers

Session Date: Early 2026 | Model: Llama 3.1 8B Instruct | Host: 10.10.0.10:5001