
A practical guide to enforcing structured JSON output from HuggingFace models using prefix_allowed_tokens_fn — covering lm-format-enforcer, outlines, and a manual LogitsProcessor approach.

Deployed Meta's Llama 3.1 8B Instruct as a self-hosted FastAPI inference server to serve as a backend for various internal tools. Documented the full API and got it working end-to-end.

With agentic AI becoming central to production systems, we embarked on a structured learning journey through all major agentic frameworks available in the market. The approach: build working demos with each framework usi…

OpenClaw is an AI agent framework. The goal was to run it in an isolated Docker environment with the current working directory bound as a volume — so agent-generated outputs would persist on the host filesystem.

Running transcription at scale on an H100 GPU required both model optimizations and infrastructure decisions around where to store output audio files.

This is a real-time AI-powered sales coaching system built for JustDial's telemarketing team. It listens to live phone calls and delivers instant AI-generated coaching insights to agent dashboards — while the call is st…

With an NVIDIA H100 GPU running production inference workloads, we needed visibility into GPU utilization, memory usage, and power draw — in real time, with historical logging.

JustDial's telemarketing team needed automated AI analysis of call transcripts. The system had to evaluate agent quality across multiple dimensions: sentiment, disposition accuracy, and pitched product identification — a…

This was an ambitious project to build a real-time AI voice agent system capable of handling phone calls via Twilio SIP, displaying as a speaking avatar (Simli), supporting multi-agent handoff, and eventually evolving in…

The goal was to take a remote MP3 audio URL (like a call recording from an internal PBX system) and transcribe it in real-time — simulating live call transcription even though the audio is pre-recorded. The challenge was…

We needed to deploy Sarvam-M, an Indian multilingual LLM from SarvamAI, on an NVIDIA H100 GPU for production inference. What seemed like a simple model-loading task turned into a series of environment and optimization ch…

How we optimized LLM inference on H100 GPUs with FlashAttention-2 and bfloat16 — and the environment pitfalls we hit along the way.