Transcription at Scale: H100 Optimizations and Switching from AWS to Local Storage

Project: Transcription All

Tech Stack: Python, FastAPI, Whisper/HuggingFace, FlashAttention-2, Docker, ffmpeg

Category: ASR, MLOps, Infrastructure


Background

Running transcription at scale on an H100 GPU required both model optimizations and infrastructure decisions around where to store output audio files.


Challenge 1: Understanding What Makes H100 Transcription Fast

Problem: Two versions of the transcription code existed. The newer transcription_h100_1 version was significantly faster but the team didn't know why.

Solution: Claude analyzed both files and identified 4 key differences:

OptimizationBaselineH100 Version
AttentionDefaultflash_attention_2
Model loadingpipeline()Direct AutoModel
Dtypefloat32torch.float16
Batch size1Dynamic batching

Flash Attention 2 was the biggest win — H100's HBM3 memory bandwidth makes FlashAttention-2 reduce memory by ~50% and significantly boost throughput compared to standard attention.

Direct model loading vs. pipeline():

python
# Slower — pipeline() adds overhead
pipe = pipeline("automatic-speech-recognition", model=model_name)

# Faster — direct model control
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2"
)

Challenge 2: AWS Dependency in Development Environment

Problem: The diarize_only folder code was storing output audio files in AWS S3, which created a dependency on AWS credentials, network latency, and cost during development.

Solution: Switched to local storage with static file serving:

  1. Removed boto3 from requirements.txt
  2. Replaced S3 upload with local file write:
python
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
output_path = OUTPUT_DIR / f"{job_id}.wav"
  1. Mounted outputs/ as a Docker volume for persistence
  2. Updated .env to replace AWS_* vars with SERVER_IP
  3. Response now returns http://{SERVER_IP}/outputs/{job_id}.wav

Challenge 3: Docker Volume for Output Persistence

Problem: Transcription outputs were lost on container restart because they were written inside the container filesystem.

Solution: Added volume mount to docker-compose.yml:

yaml
services:
  transcription:
    volumes:
      - ./outputs:/app/outputs

Key Learnings

  • FlashAttention-2 is consistently the highest-impact optimization for H100 transformer inference
  • Local storage + static serving is simpler and faster than S3 for development and intranet deployments
  • Docker volume mounts for ML outputs are essential — containers are ephemeral, outputs are not

Session Date: Early 2026 | Hardware: NVIDIA H100 | Model: Whisper/HuggingFace ASR