Building an H100 GPU Monitoring Dashboard with Streamlit
Project: GPU Utilization Dashboard
Tech Stack: Python, Streamlit, pynvml, pandas, CSV logging
Category: MLOps, Infrastructure Monitoring
Background
With an NVIDIA H100 GPU running production inference workloads, we needed visibility into GPU utilization, memory usage, and power draw — in real time, with historical logging.
Challenge 1: Structuring Logs for Long-Term Retention
Problem: Logs were being written to a single gpu_log.csv file that would grow indefinitely. No organization by date, no rotation, no cleanup of old data.
Solution: Implemented a date-based folder structure with 30-day retention:
logs/
├── 2026_03_23/
│ └── gpu_metrics.csv
├── 2026_03_24/
│ └── gpu_metrics.csv
└── ... (30 days kept, older auto-deleted)Implementation:
def get_log_path():
today = datetime.now().strftime("%Y_%m_%d")
log_dir = Path("logs") / today
log_dir.mkdir(parents=True, exist_ok=True)
# Clean up logs older than 30 days
cutoff = datetime.now() - timedelta(days=30)
for d in Path("logs").iterdir():
if d.is_dir():
try:
dir_date = datetime.strptime(d.name, "%Y_%m_%d")
if dir_date < cutoff:
shutil.rmtree(d)
except ValueError:
pass
return log_dir / "gpu_metrics.csv"Challenge 2: Removing Dead Code
Problem: The project had a main.py alongside gpu_dashboard.py — the old version with no threaded logging. It created confusion about which file to run.
Solution: Deleted main.py entirely — gpu_dashboard.py with threaded background logging and the date-based log structure was the canonical version. Dead code removed cleanly.
What the Dashboard Monitors
| Metric | Update Interval | Retention |
|---|---|---|
| GPU Utilization (%) | 10 seconds | 30 days |
| Memory Used (GB / 81GB) | 10 seconds | 30 days |
| Power Draw (W) | 10 seconds | 30 days |
Live charts via Streamlit with 7-day history view.
Key Learnings
- Date-based log folders with auto-cleanup prevent disk bloat in long-running monitoring systems
- Threaded background logging avoids Streamlit's re-run model interfering with data collection
pynvmlis the go-to library for programmatic NVIDIA GPU metrics
Session Date: March 2026 | Hardware: NVIDIA H100 80GB
