12 articles
JSON Constrained Decoding in Transformers (Without SGLang)
02
LLM DeploymentPythonTransformers

JSON Constrained Decoding in Transformers (Without SGLang)

A practical guide to enforcing structured JSON output from HuggingFace models using prefix_allowed_tokens_fn — covering lm-format-enforcer, outlines, and a manual LogitsProcessor approach.

Deploying Llama 8B as a FastAPI Inference Server
03
LLM DeploymentPythonFastAPI

Deploying Llama 8B as a FastAPI Inference Server

Deployed Meta's Llama 3.1 8B Instruct as a self-hosted FastAPI inference server to serve as a backend for various internal tools. Documented the full API and got it working end-to-end.

Learning Modern Agentic Frameworks: A Practical Hands-On Guide
04
AI AgentsPythonAnthropic Claude API

Learning Modern Agentic Frameworks: A Practical Hands-On Guide

With agentic AI becoming central to production systems, we embarked on a structured learning journey through all major agentic frameworks available in the market. The approach: build working demos with each framework usi…

Running OpenClaw AI Agent in an Isolated Docker Environment
05
DevOpsDockerDocker Compose

Running OpenClaw AI Agent in an Isolated Docker Environment

OpenClaw is an AI agent framework. The goal was to run it in an isolated Docker environment with the current working directory bound as a volume — so agent-generated outputs would persist on the host filesystem.

Transcription at Scale: H100 Optimizations and Switching from AWS to Local Storage
06
ASRPythonFastAPI

Transcription at Scale: H100 Optimizations and Switching from AWS to Local Storage

Running transcription at scale on an H100 GPU required both model optimizations and infrastructure decisions around where to store output audio files.

Real-Time AI Sales Coaching for Telemarketing Agents
07
Real-Time AINode.jsTypeScript

Real-Time AI Sales Coaching for Telemarketing Agents

This is a real-time AI-powered sales coaching system built for JustDial's telemarketing team. It listens to live phone calls and delivers instant AI-generated coaching insights to agent dashboards — while the call is st…

Building an H100 GPU Monitoring Dashboard with Streamlit
08
MLOpsPythonStreamlit

Building an H100 GPU Monitoring Dashboard with Streamlit

With an NVIDIA H100 GPU running production inference workloads, we needed visibility into GPU utilization, memory usage, and power draw — in real time, with historical logging.

Building a Multi-Endpoint LLM Call Analysis API with Llama 3.1
09
LLM ApplicationsPythonFastAPI

Building a Multi-Endpoint LLM Call Analysis API with Llama 3.1

JustDial's telemarketing team needed automated AI analysis of call transcripts. The system had to evaluate agent quality across multiple dimensions: sentiment, disposition accuracy, and pitched product identification — a…

Building a Full-Stack LiveKit Voice AI Agent with Twilio SIP and Simli Avatars
10
Voice AIPythonLiveKit

Building a Full-Stack LiveKit Voice AI Agent with Twilio SIP and Simli Avatars

This was an ambitious project to build a real-time AI voice agent system capable of handling phone calls via Twilio SIP, displaying as a speaking avatar (Simli), supporting multi-agent handoff, and eventually evolving in…

Building a Real-Time STT Pipeline from Remote Audio URLs Using Sarvam AI
11
Speech-to-TextPythonasyncio

Building a Real-Time STT Pipeline from Remote Audio URLs Using Sarvam AI

The goal was to take a remote MP3 audio URL (like a call recording from an internal PBX system) and transcribe it in real-time — simulating live call transcription even though the audio is pre-recorded. The challenge was…

Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference
12
LLM DeploymentPythonPyTorch

Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference

We needed to deploy Sarvam-M, an Indian multilingual LLM from SarvamAI, on an NVIDIA H100 GPU for production inference. What seemed like a simple model-loading task turned into a series of environment and optimization ch…

Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference
13
LLM DeploymentH100FlashAttention-2

Running Sarvam-M LLM on NVIDIA H100: From Zero to Inference

How we optimized LLM inference on H100 GPUs with FlashAttention-2 and bfloat16 — and the environment pitfalls we hit along the way.