Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference

Background

We needed to deploy Sarvam-M, an Indian multilingual LLM, on an NVIDIA H100 GPU for production inference.

Challenge 1: H100-Specific Optimizations

The base HuggingFace code ran, but was not leveraging the H100 fully.

python

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
)

Challenge 2: Silent Exit with `python server.py`

FastAPI apps need a __main__ block or uvicorn — running bare python does nothing.

Challenge 3: flash-attn Build Failure

Install torch first, then flash-attn:

bash

uv pip install torch>=2.3.0
uv pip install flash-attn --no-build-isolation

Key Learnings

H100s require explicit optimization flags
Always install compilation-dependent packages after their dependencies
uv is significantly faster than pip for ML environments