Background
We needed to deploy Sarvam-M, an Indian multilingual LLM, on an NVIDIA H100 GPU for production inference.
Challenge 1: H100-Specific Optimizations
The base HuggingFace code ran, but was not leveraging the H100 fully.
python
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2"
)Challenge 2: Silent Exit with `python server.py`
FastAPI apps need a __main__ block or uvicorn — running bare python does nothing.
Challenge 3: flash-attn Build Failure
Install torch first, then flash-attn:
bash
uv pip install torch>=2.3.0
uv pip install flash-attn --no-build-isolationKey Learnings
- H100s require explicit optimization flags
- Always install compilation-dependent packages after their dependencies
uvis significantly faster than pip for ML environments
