Building a Real-Time STT Pipeline from Remote Audio URLs Using Sarvam AI
Project: Sarvam STT API
Tech Stack: Python, asyncio, ffmpeg, WebSockets, SarvamAI SDK
Category: Speech-to-Text, Real-Time Audio Processing
Background
The goal was to take a remote MP3 audio URL (like a call recording from an internal PBX system) and transcribe it in real-time — simulating live call transcription even though the audio is pre-recorded. The challenge was streaming pre-recorded audio through an API that expects live WebSocket audio chunks.
Challenge 1: Designing the Simulated Real-Time Pipeline
Problem: Sarvam's streaming STT API expects live WebSocket chunks of PCM audio at 16kHz. But we had remote MP3 files, not live microphone input.
Solution: Claude designed a pipeline using ffmpeg as the bridge:
Remote URL → ffmpeg (decode + resample to 16kHz PCM) → 100ms chunks → Sarvam WebSocket → transcriptsKey implementation details:
ffmpegsubprocess streams raw PCM to stdout asynchronously- Chunks are sized at 3200 bytes = exactly 100ms at 16kHz 16-bit mono
- A sender coroutine and receiver coroutine run concurrently via
asyncio
proc = await asyncio.create_subprocess_exec(
"ffmpeg", "-i", audio_url,
"-ar", "16000", "-ac", "1", "-f", "s16le", "pipe:1",
stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.DEVNULL
)Challenge 2: Missing `sarvamai` Module
Problem:
ModuleNotFoundError: No module named 'sarvamai'Solution: The package name on PyPI is different from the import name:
pip install sarvam-aiChallenge 3: API Key Security
Problem: The user pasted the raw API key into the chat to test.
Solution: Claude refused to write the key to any file. The key was injected only as an environment variable at runtime:
export SARVAM_API_KEY=your_key_here
python realtime_transcribe.py http://your-server/call.mp3 en-INChallenge 4: Expanding to a Full FastAPI Service
Problem: The script worked but needed to become a deployable API — multiple transcription requests, language selection, proper error handling.
Solution: Built a FastAPI wrapper with:
- Endpoint accepting
audio_urlandlanguage_code - Background task for streaming
- Server-Sent Events (SSE) for real-time transcript delivery to clients
Key Learnings
ffmpegis the Swiss Army knife for any audio format conversion in Python pipelines- Async subprocess + async WebSocket is the right pattern for real-time audio streaming
- Always stream API keys via environment variables, never hardcode
- Pre-recorded audio can simulate real-time by chunking with appropriate sleep intervals
Session Date: Early 2026 | API: SarvamAI Streaming STT
