Building a Real-Time STT Pipeline from Remote Audio URLs Using Sarvam AI

Project: Sarvam STT API

Tech Stack: Python, asyncio, ffmpeg, WebSockets, SarvamAI SDK

Category: Speech-to-Text, Real-Time Audio Processing

Background

The goal was to take a remote MP3 audio URL (like a call recording from an internal PBX system) and transcribe it in real-time — simulating live call transcription even though the audio is pre-recorded. The challenge was streaming pre-recorded audio through an API that expects live WebSocket audio chunks.

Challenge 1: Designing the Simulated Real-Time Pipeline

Problem: Sarvam's streaming STT API expects live WebSocket chunks of PCM audio at 16kHz. But we had remote MP3 files, not live microphone input.

Solution: Claude designed a pipeline using ffmpeg as the bridge:

Remote URL → ffmpeg (decode + resample to 16kHz PCM) → 100ms chunks → Sarvam WebSocket → transcripts

Key implementation details:

ffmpeg subprocess streams raw PCM to stdout asynchronously
Chunks are sized at 3200 bytes = exactly 100ms at 16kHz 16-bit mono
A sender coroutine and receiver coroutine run concurrently via asyncio

python

proc = await asyncio.create_subprocess_exec(
    "ffmpeg", "-i", audio_url,
    "-ar", "16000", "-ac", "1", "-f", "s16le", "pipe:1",
    stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.DEVNULL
)

Challenge 2: Missing `sarvamai` Module

Problem:

ModuleNotFoundError: No module named 'sarvamai'

Solution: The package name on PyPI is different from the import name:

bash

pip install sarvam-ai

Challenge 3: API Key Security

Problem: The user pasted the raw API key into the chat to test.

Solution: Claude refused to write the key to any file. The key was injected only as an environment variable at runtime:

bash

export SARVAM_API_KEY=your_key_here
python realtime_transcribe.py http://your-server/call.mp3 en-IN

Challenge 4: Expanding to a Full FastAPI Service

Problem: The script worked but needed to become a deployable API — multiple transcription requests, language selection, proper error handling.

Solution: Built a FastAPI wrapper with:

Endpoint accepting audio_url and language_code
Background task for streaming
Server-Sent Events (SSE) for real-time transcript delivery to clients

Key Learnings

ffmpeg is the Swiss Army knife for any audio format conversion in Python pipelines
Async subprocess + async WebSocket is the right pattern for real-time audio streaming
Always stream API keys via environment variables, never hardcode
Pre-recorded audio can simulate real-time by chunking with appropriate sleep intervals

Session Date: Early 2026 | API: SarvamAI Streaming STT