Background

We needed to deploy Sarvam-M, an Indian multilingual LLM, on an NVIDIA H100 GPU for production inference.

Challenge 1: H100-Specific Optimizations

The base HuggingFace code ran, but was not leveraging the H100 fully.

python
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
)

Challenge 2: Silent Exit with `python server.py`

FastAPI apps need a __main__ block or uvicorn — running bare python does nothing.

Challenge 3: flash-attn Build Failure

Install torch first, then flash-attn:

bash
uv pip install torch>=2.3.0
uv pip install flash-attn --no-build-isolation

Key Learnings

  • H100s require explicit optimization flags
  • Always install compilation-dependent packages after their dependencies
  • uv is significantly faster than pip for ML environments