Building an H100 GPU Monitoring Dashboard with Streamlit

Project: GPU Utilization Dashboard

Tech Stack: Python, Streamlit, pynvml, pandas, CSV logging

Category: MLOps, Infrastructure Monitoring


Background

With an NVIDIA H100 GPU running production inference workloads, we needed visibility into GPU utilization, memory usage, and power draw — in real time, with historical logging.


Challenge 1: Structuring Logs for Long-Term Retention

Problem: Logs were being written to a single gpu_log.csv file that would grow indefinitely. No organization by date, no rotation, no cleanup of old data.

Solution: Implemented a date-based folder structure with 30-day retention:

logs/
├── 2026_03_23/
│   └── gpu_metrics.csv
├── 2026_03_24/
│   └── gpu_metrics.csv
└── ...  (30 days kept, older auto-deleted)

Implementation:

python
def get_log_path():
    today = datetime.now().strftime("%Y_%m_%d")
    log_dir = Path("logs") / today
    log_dir.mkdir(parents=True, exist_ok=True)
    # Clean up logs older than 30 days
    cutoff = datetime.now() - timedelta(days=30)
    for d in Path("logs").iterdir():
        if d.is_dir():
            try:
                dir_date = datetime.strptime(d.name, "%Y_%m_%d")
                if dir_date < cutoff:
                    shutil.rmtree(d)
            except ValueError:
                pass
    return log_dir / "gpu_metrics.csv"

Challenge 2: Removing Dead Code

Problem: The project had a main.py alongside gpu_dashboard.py — the old version with no threaded logging. It created confusion about which file to run.

Solution: Deleted main.py entirely — gpu_dashboard.py with threaded background logging and the date-based log structure was the canonical version. Dead code removed cleanly.


What the Dashboard Monitors

MetricUpdate IntervalRetention
GPU Utilization (%)10 seconds30 days
Memory Used (GB / 81GB)10 seconds30 days
Power Draw (W)10 seconds30 days

Live charts via Streamlit with 7-day history view.


Key Learnings

  • Date-based log folders with auto-cleanup prevent disk bloat in long-running monitoring systems
  • Threaded background logging avoids Streamlit's re-run model interfering with data collection
  • pynvml is the go-to library for programmatic NVIDIA GPU metrics

Session Date: March 2026 | Hardware: NVIDIA H100 80GB