Transcription at Scale: H100 Optimizations and Switching from AWS to Local Storage
Project: Transcription All
Tech Stack: Python, FastAPI, Whisper/HuggingFace, FlashAttention-2, Docker, ffmpeg
Category: ASR, MLOps, Infrastructure
Background
Running transcription at scale on an H100 GPU required both model optimizations and infrastructure decisions around where to store output audio files.
Challenge 1: Understanding What Makes H100 Transcription Fast
Problem: Two versions of the transcription code existed. The newer transcription_h100_1 version was significantly faster but the team didn't know why.
Solution: Claude analyzed both files and identified 4 key differences:
| Optimization | Baseline | H100 Version |
|---|---|---|
| Attention | Default | flash_attention_2 |
| Model loading | pipeline() | Direct AutoModel |
| Dtype | float32 | torch.float16 |
| Batch size | 1 | Dynamic batching |
Flash Attention 2 was the biggest win — H100's HBM3 memory bandwidth makes FlashAttention-2 reduce memory by ~50% and significantly boost throughput compared to standard attention.
Direct model loading vs. pipeline():
# Slower — pipeline() adds overhead
pipe = pipeline("automatic-speech-recognition", model=model_name)
# Faster — direct model control
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
)Challenge 2: AWS Dependency in Development Environment
Problem: The diarize_only folder code was storing output audio files in AWS S3, which created a dependency on AWS credentials, network latency, and cost during development.
Solution: Switched to local storage with static file serving:
- Removed
boto3fromrequirements.txt - Replaced S3 upload with local file write:
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
output_path = OUTPUT_DIR / f"{job_id}.wav"- Mounted
outputs/as a Docker volume for persistence - Updated
.envto replaceAWS_*vars withSERVER_IP - Response now returns
http://{SERVER_IP}/outputs/{job_id}.wav
Challenge 3: Docker Volume for Output Persistence
Problem: Transcription outputs were lost on container restart because they were written inside the container filesystem.
Solution: Added volume mount to docker-compose.yml:
services:
transcription:
volumes:
- ./outputs:/app/outputsKey Learnings
- FlashAttention-2 is consistently the highest-impact optimization for H100 transformer inference
- Local storage + static serving is simpler and faster than S3 for development and intranet deployments
- Docker volume mounts for ML outputs are essential — containers are ephemeral, outputs are not
Session Date: Early 2026 | Hardware: NVIDIA H100 | Model: Whisper/HuggingFace ASR
