Transcription at Scale: H100 Optimizations and Switching from AWS to Local Storage

Project: Transcription All

Tech Stack: Python, FastAPI, Whisper/HuggingFace, FlashAttention-2, Docker, ffmpeg

Category: ASR, MLOps, Infrastructure

Background

Running transcription at scale on an H100 GPU required both model optimizations and infrastructure decisions around where to store output audio files.

Challenge 1: Understanding What Makes H100 Transcription Fast

Problem: Two versions of the transcription code existed. The newer transcription_h100_1 version was significantly faster but the team didn't know why.

Solution: Claude analyzed both files and identified 4 key differences:

Optimization	Baseline	H100 Version
Attention	Default	`flash_attention_2`
Model loading	`pipeline()`	Direct `AutoModel`
Dtype	`float32`	`torch.float16`
Batch size	1	Dynamic batching

Flash Attention 2 was the biggest win — H100's HBM3 memory bandwidth makes FlashAttention-2 reduce memory by ~50% and significantly boost throughput compared to standard attention.

Direct model loading vs. pipeline():

python

# Slower — pipeline() adds overhead
pipe = pipeline("automatic-speech-recognition", model=model_name)

# Faster — direct model control
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2"
)

Challenge 2: AWS Dependency in Development Environment

Problem: The diarize_only folder code was storing output audio files in AWS S3, which created a dependency on AWS credentials, network latency, and cost during development.

Solution: Switched to local storage with static file serving:

Removed boto3 from requirements.txt
Replaced S3 upload with local file write:

python

OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
output_path = OUTPUT_DIR / f"{job_id}.wav"

Mounted outputs/ as a Docker volume for persistence
Updated .env to replace AWS_* vars with SERVER_IP
Response now returns http://{SERVER_IP}/outputs/{job_id}.wav

Challenge 3: Docker Volume for Output Persistence

Problem: Transcription outputs were lost on container restart because they were written inside the container filesystem.

Solution: Added volume mount to docker-compose.yml:

yaml

services:
  transcription:
    volumes:
      - ./outputs:/app/outputs

Key Learnings

FlashAttention-2 is consistently the highest-impact optimization for H100 transformer inference
Local storage + static serving is simpler and faster than S3 for development and intranet deployments
Docker volume mounts for ML outputs are essential — containers are ephemeral, outputs are not

Session Date: Early 2026 | Hardware: NVIDIA H100 | Model: Whisper/HuggingFace ASR