Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference
Project: Sarvam Inference
Tech Stack: Python, PyTorch, HuggingFace Transformers, FlashAttention-2, uv, FastAPI
Category: LLM Deployment, GPU Optimization
Background
We needed to deploy Sarvam-M, an Indian multilingual LLM from SarvamAI, on an NVIDIA H100 GPU for production inference. What seemed like a simple model-loading task turned into a series of environment and optimization challenges.
Challenge 1: H100-Specific Optimizations Were Not Obvious
Problem: The base HuggingFace code ran, but it wasn't leveraging the H100's full capabilities — no FlashAttention-2, wrong dtype, suboptimal memory usage.
Solution: Claude identified the exact H100 optimizations:
- Switched to
torch.bfloat16— H100's native dtype, better than float16 - Added
attn_implementation="flash_attention_2"— H100's HBM3 bandwidth makes FlashAttention-2 ~50% faster - Pinned
device_map="cuda:0"instead of auto-split
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2"
)Challenge 2: `python` vs `python3` and Missing `__main__` Block
Problem: Running python server.py silently exited — no output, no error. The server just didn't start.
Root cause: Two issues combined:
- The system had
python3notpython - The FastAPI app had no
if __name__ == "__main__"block, so running it as a script did nothing
Solution:
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)Challenge 3: `flash-attn` Build Failure
Problem: Installing flash-attn failed with a cryptic build error when using uv pip install -r requirements.txt.
Root cause: flash-attn compiles C++ CUDA kernels at install time and requires torch to already be present in the environment.
Solution: Install in two steps:
uv pip install torch>=2.3.0 transformers>=4.45.0 accelerate>=0.34.0
uv pip install flash-attn --no-build-isolationChallenge 4: `uv` Not Available Without Admin Rights
Problem: snap install astral-uv required administrator privileges on the JustDial VM.
Solution: Install to home directory without sudo:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/envKey Learnings
- H100s require explicit optimization flags — defaults leave ~50% performance on the table
- Always install compilation-dependent packages after their dependencies
uvis a significantly faster alternative to pip for managing ML environments- FlashAttention-2 is a must-have for any H100 LLM deployment
Session Date: Early 2026 | Model: Sarvam-M (sarvamai/sarvam-m)
