Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference

Project: Sarvam Inference

Tech Stack: Python, PyTorch, HuggingFace Transformers, FlashAttention-2, uv, FastAPI

Category: LLM Deployment, GPU Optimization

Background

We needed to deploy Sarvam-M, an Indian multilingual LLM from SarvamAI, on an NVIDIA H100 GPU for production inference. What seemed like a simple model-loading task turned into a series of environment and optimization challenges.

Challenge 1: H100-Specific Optimizations Were Not Obvious

Problem: The base HuggingFace code ran, but it wasn't leveraging the H100's full capabilities — no FlashAttention-2, wrong dtype, suboptimal memory usage.

Solution: Claude identified the exact H100 optimizations:

Switched to torch.bfloat16 — H100's native dtype, better than float16
Added attn_implementation="flash_attention_2" — H100's HBM3 bandwidth makes FlashAttention-2 ~50% faster
Pinned device_map="cuda:0" instead of auto-split

python

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
)

Challenge 2: `python` vs `python3` and Missing `main` Block

Problem: Running python server.py silently exited — no output, no error. The server just didn't start.

Root cause: Two issues combined:

The system had python3 not python
The FastAPI app had no if __name__ == "__main__" block, so running it as a script did nothing

Solution:

python

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Challenge 3: `flash-attn` Build Failure

Problem: Installing flash-attn failed with a cryptic build error when using uv pip install -r requirements.txt.

Root cause: flash-attn compiles C++ CUDA kernels at install time and requires torch to already be present in the environment.

Solution: Install in two steps:

bash

uv pip install torch>=2.3.0 transformers>=4.45.0 accelerate>=0.34.0
uv pip install flash-attn --no-build-isolation

Challenge 4: `uv` Not Available Without Admin Rights

Problem: snap install astral-uv required administrator privileges on the JustDial VM.

Solution: Install to home directory without sudo:

bash

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Key Learnings

H100s require explicit optimization flags — defaults leave ~50% performance on the table
Always install compilation-dependent packages after their dependencies
uv is a significantly faster alternative to pip for managing ML environments
FlashAttention-2 is a must-have for any H100 LLM deployment

Session Date: Early 2026 | Model: Sarvam-M (sarvamai/sarvam-m)

Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference

Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference

Background

Challenge 1: H100-Specific Optimizations Were Not Obvious

Challenge 2: `python` vs `python3` and Missing `__main__` Block

Challenge 3: `flash-attn` Build Failure

Challenge 4: `uv` Not Available Without Admin Rights

Key Learnings

Challenge 2: `python` vs `python3` and Missing `main` Block