Prerequisites
- Python 3.11 or later
- Docker Engine 24+ with the NVIDIA Container Toolkit installed
- An NVIDIA GPU (A10G, A100, or H100 recommended) or a GPU cloud instance (Lambda Labs, RunPod, CoreWeave)
- A Hugging Face account and token with access to the model you intend to serve
- Familiarity with bash and basic HTTP APIs
curlandjqavailable on your host machine
Setup
Install the Python dependencies used by the monitoring client. These run locally against the vLLM server’s HTTP API, so no GPU is required on the machine running the client.
uv pip install requests matplotlib numpy rich
Export the environment variables the scripts will reference. Replace the placeholder values with your own.
export HF_TOKEN="hf_your_token_here"
export VLLM_HOST="http://localhost:8000"
export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"
Step 1: Write the vLLM Server Launch Script
vLLM exposes KV cache statistics through its /metrics Prometheus endpoint and, per-request, through the usage field in OpenAI-compatible completions responses when --enable-prefix-caching is active [1]. The flags below enable prefix caching and set a generous GPU memory utilization so the cache has room to grow across turns.
Save the following as launch_vllm.sh:
# filename: launch_vllm.sh
#!/usr/bin/env bash
# start a vLLM server configured for agentic prefix caching
set -euo pipefail
MODEL="${MODEL_ID:-mistralai/Mistral-7B-Instruct-v0.3}"
HF_TOKEN="${HF_TOKEN:-}"
GPU_MEM_UTIL="${GPU_MEM_UTIL:-0.90}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
PORT="${PORT:-8000}"
if [[ -z "$HF_TOKEN" ]]; then
echo "ERROR: HF_TOKEN is not set." >&2
exit 1
fi
echo "Launching vLLM server for model: $MODEL"
echo " GPU memory utilization : $GPU_MEM_UTIL"
echo " Max model length : $MAX_MODEL_LEN tokens"
echo " Prefix caching : enabled"
echo " Port : $PORT"
docker run --rm --gpus all \
--name vllm-agentic \
-p "${PORT}:8000" \
-e "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
-v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
vllm/vllm-openai:latest \
--model "$MODEL" \
--gpu-memory-utilization "$GPU_MEM_UTIL" \
--max-model-len "$MAX_MODEL_LEN" \
--enable-prefix-caching \
--disable-log-requests \
--port 8000
Key flags explained:
--enable-prefix-caching: activates vLLM’s radix-tree prefix cache. Tokens whose prefix hash matches a cached block are served from VRAM without recomputation.--gpu-memory-utilization 0.90: reserves 90% of VRAM for the KV cache pool, maximising the number of blocks available for multi-turn sessions.--max-model-len 8192: caps context length so the cache block table stays within budget on smaller GPUs.
To start the server on your GPU machine, run:
bash launch_vllm.sh
Step 2: Write the Prometheus Metrics Scraper
vLLM exposes a /metrics endpoint in Prometheus text format. The metric vllm:gpu_prefix_cache_hit_rate gives the rolling cache hit rate across all requests. Save the following as metrics_scraper.py:
# filename: metrics_scraper.py
"""metrics_scraper.py — parse vLLM Prometheus metrics for KV cache stats."""
from __future__ import annotations
import re
import requests
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class CacheMetrics:
gpu_prefix_cache_hit_rate: Optional[float] = None
gpu_cache_usage_perc: Optional[float] = None
num_running_requests: Optional[int] = None
num_waiting_requests: Optional[int] = None
_PATTERNS = {
"gpu_prefix_cache_hit_rate": re.compile(
r'^vllm:gpu_prefix_cache_hit_rate\{[^}]*\}\s+([\d.eE+\-]+)', re.M
),
"gpu_cache_usage_perc": re.compile(
r'^vllm:gpu_cache_usage_perc\{[^}]*\}\s+([\d.eE+\-]+)', re.M
),
"num_running": re.compile(
r'^vllm:num_requests_running\{[^}]*\}\s+([\d.eE+\-]+)', re.M
),
"num_waiting": re.compile(
r'^vllm:num_requests_waiting\{[^}]*\}\s+([\d.eE+\-]+)', re.M
),
}
def scrape(host: str, timeout: float = 5.0) -> CacheMetrics:
"""Fetch /metrics from a running vLLM server and parse KV cache fields."""
url = f"{host.rstrip('/')}/metrics"
resp = requests.get(url, timeout=timeout)
resp.raise_for_status()
text = resp.text
def _float(key: str) -> Optional[float]:
m = _PATTERNS[key].search(text)
return float(m.group(1)) if m else None
return CacheMetrics(
gpu_prefix_cache_hit_rate=_float("gpu_prefix_cache_hit_rate"),
gpu_cache_usage_perc=_float("gpu_cache_usage_perc"),
num_running_requests=int(_float("num_running") or 0),
num_waiting_requests=int(_float("num_waiting") or 0),
)
Step 3: Write the Multi-Turn Session Simulator
Agentic workloads replay long shared prefixes on every turn, which is exactly where prefix caching pays off and where cache misses cause the TTFT regressions documented in [1]. The simulator below sends a series of chat completions that grow a shared system prompt across turns, records the per-request prompt token count (a proxy for cache reuse when the server logs num_cached_tokens), and scrapes the Prometheus hit-rate after each turn. Save the following as session_simulator.py:
# filename: session_simulator.py
"""session_simulator.py — simulate a multi-turn agentic session and collect KV cache metrics."""
from __future__ import annotations
import os
import time
import json
import requests
from dataclasses import dataclass, field
from typing import List, Dict, Any
from metrics_scraper import scrape, CacheMetrics
@dataclass
class TurnRecord:
turn: int
prompt_tokens: int
completion_tokens: int
latency_s: float
cache_hit_rate: float
cache_usage_perc: float
# A realistic agentic system prompt that stays constant across turns.
# In production this would be a tool schema + retrieved context block.
SYSTEM_PROMPT = """
You are a precise coding assistant. You have access to the following tools:
1. search_codebase(query: str) -> List[str]: Returns file paths matching query.
2. read_file(path: str) -> str: Returns file contents.
3. write_file(path: str, content: str) -> bool: Writes content to path.
4. run_tests(test_path: str) -> dict: Runs tests and returns results.
5. git_diff() -> str: Returns current working-tree diff.
Always reason step by step before calling a tool. Prefer minimal diffs.
When uncertain, ask a clarifying question rather than guessing.
""".strip()
# Simulated user turns that build on each other (agentic pattern).
USER_TURNS = [
"List all Python files in the repository.",
"Read the contents of src/main.py and summarize what it does.",
"Find all functions in src/main.py that lack docstrings.",
"Add a docstring to the `process_batch` function. Show me the diff.",
"Run the unit tests for src/main.py and report any failures.",
"Fix the first failing test and show the corrected code.",
"Commit the changes with an appropriate message. What would you write?",
"Now check if there are similar undocumented functions in src/utils.py.",
]
def chat_completion(
host: str,
model: str,
messages: List[Dict[str, str]],
max_tokens: int = 256,
) -> Dict[str, Any]:
url = f"{host.rstrip('/')}/v1/chat/completions"
payload = {
"model": model,
"messages": messages,
"max_tokens": max_tokens,
"temperature": 0.0,
}
resp = requests.post(url, json=payload, timeout=120)
resp.raise_for_status()
return resp.json()
def run_session(
host: str,
model: str,
turns: List[str] = USER_TURNS,
inter_turn_delay: float = 0.5,
) -> List[TurnRecord]:
messages: List[Dict[str, str]] = [
{"role": "system", "content": SYSTEM_PROMPT}
]
records: List[TurnRecord] = []
for i, user_msg in enumerate(turns):
messages.append({"role": "user", "content": user_msg})
t0 = time.perf_counter()
result = chat_completion(host, model, messages, max_tokens=256)
latency = time.perf_counter() - t0
usage = result.get("usage", {})
prompt_tokens = usage.get("prompt_tokens", 0)
completion_tokens = usage.get("completion_tokens", 0)
# Append assistant reply so next turn sees full history.
assistant_content = result["choices"][0]["message"]["content"]
messages.append({"role": "assistant", "content": assistant_content})
# Scrape Prometheus metrics right after the request completes.
try:
cm: CacheMetrics = scrape(host)
hit_rate = cm.gpu_prefix_cache_hit_rate or 0.0
usage_perc = cm.gpu_cache_usage_perc or 0.0
except Exception:
hit_rate, usage_perc = 0.0, 0.0
rec = TurnRecord(
turn=i + 1,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
latency_s=latency,
cache_hit_rate=hit_rate,
cache_usage_perc=usage_perc,
)
records.append(rec)
print(
f"Turn {rec.turn:2d} | prompt_tokens={rec.prompt_tokens:5d} "
f"| latency={rec.latency_s:.2f}s "
f"| cache_hit_rate={rec.cache_hit_rate:.3f} "
f"| cache_usage={rec.cache_usage_perc:.3f}"
)
time.sleep(inter_turn_delay)
return records
Step 4: Write the Plotting and Reporting Module
This module takes the list of TurnRecord objects and produces two charts: cache hit rate per turn and prompt token growth (which shows how much prefix is being reused). It also prints a Rich summary table. Save the following as reporter.py:
# filename: reporter.py
"""reporter.py — plot KV cache efficiency metrics from a simulated session."""
from __future__ import annotations
import os
from typing import List
import matplotlib
matplotlib.use("Agg") # headless backend for servers
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
try:
from rich.console import Console
from rich.table import Table
_RICH = True
except ImportError:
_RICH = False
from session_simulator import TurnRecord
def print_summary_table(records: List[TurnRecord]) -> None:
if not _RICH:
for r in records:
print(r)
return
console = Console()
table = Table(title="KV Cache Efficiency — Multi-Turn Session", show_lines=True)
table.add_column("Turn", justify="right")
table.add_column("Prompt Tokens", justify="right")
table.add_column("Completion Tokens", justify="right")
table.add_column("Latency (s)", justify="right")
table.add_column("Cache Hit Rate", justify="right")
table.add_column("Cache Usage", justify="right")
for r in records:
hit_color = "green" if r.cache_hit_rate > 0.5 else "yellow" if r.cache_hit_rate > 0.2 else "red"
table.add_row(
str(r.turn),
str(r.prompt_tokens),
str(r.completion_tokens),
f"{r.latency_s:.2f}",
f"[{hit_color}]{r.cache_hit_rate:.3f}[/{hit_color}]",
f"{r.cache_usage_perc:.3f}",
)
console.print(table)
def plot_session(records: List[TurnRecord], output_path: str = "cache_efficiency.png") -> str:
turns = [r.turn for r in records]
hit_rates = [r.cache_hit_rate for r in records]
prompt_tokens = [r.prompt_tokens for r in records]
latencies = [r.latency_s for r in records]
fig = plt.figure(figsize=(12, 8))
fig.suptitle("vLLM KV Cache Efficiency — Agentic Multi-Turn Session", fontsize=14, fontweight="bold")
gs = gridspec.GridSpec(2, 2, figure=fig, hspace=0.4, wspace=0.35)
# Panel 1: Cache hit rate over turns
ax1 = fig.add_subplot(gs[0, 0])
ax1.plot(turns, hit_rates, marker="o", color="steelblue", linewidth=2)
ax1.axhline(0.5, color="orange", linestyle="--", linewidth=1, label="50% threshold")
ax1.set_xlabel("Turn")
ax1.set_ylabel("Cache Hit Rate")
ax1.set_title("Prefix Cache Hit Rate per Turn")
ax1.set_ylim(0, 1.05)
ax1.legend(fontsize=8)
ax1.grid(True, alpha=0.3)
# Panel 2: Prompt token growth
ax2 = fig.add_subplot(gs[0, 1])
ax2.bar(turns, prompt_tokens, color="mediumseagreen", alpha=0.8)
ax2.set_xlabel("Turn")
ax2.set_ylabel("Prompt Tokens")
ax2.set_title("Prompt Token Count (prefix growth)")
ax2.grid(True, alpha=0.3, axis="y")
# Panel 3: Latency over turns
ax3 = fig.add_subplot(gs[1, 0])
ax3.plot(turns, latencies, marker="s", color="tomato", linewidth=2)
ax3.set_xlabel("Turn")
ax3.set_ylabel("Latency (s)")
ax3.set_title("End-to-End Request Latency")
ax3.grid(True, alpha=0.3)
# Panel 4: Estimated tokens saved by cache
ax4 = fig.add_subplot(gs[1, 1])
tokens_saved = [int(r.prompt_tokens * r.cache_hit_rate) for r in records]
ax4.bar(turns, tokens_saved, color="mediumpurple", alpha=0.8)
ax4.set_xlabel("Turn")
ax4.set_ylabel("Tokens Saved")
ax4.set_title("Estimated Prompt Tokens Served from Cache")
ax4.grid(True, alpha=0.3, axis="y")
plt.savefig(output_path, dpi=150, bbox_inches="tight")
plt.close(fig)
return output_path
Step 5: Write the Main Entry Point
This script ties everything together. When a live vLLM server is not reachable, it falls back to synthetic data so you can verify the plotting pipeline locally without a GPU. Save the following as monitor_cache.py:
# filename: monitor_cache.py
"""monitor_cache.py — entry point for KV cache monitoring."""
from __future__ import annotations
import os
import sys
import random
import math
VLLM_HOST = os.environ.get("VLLM_HOST", "http://localhost:8000")
MODEL_ID = os.environ.get("MODEL_ID", "mistralai/Mistral-7B-Instruct-v0.3")
def _synthetic_records(n_turns: int = 8):
"""Generate plausible synthetic TurnRecords for offline testing."""
from session_simulator import TurnRecord
records = []
base_tokens = 180 # system prompt tokens
for i in range(1, n_turns + 1):
prompt_tokens = base_tokens + i * 95 + random.randint(-10, 10)
# Hit rate rises as the shared prefix grows — mirrors real agentic behaviour.
hit_rate = min(0.95, 0.05 + 0.12 * i + random.uniform(-0.03, 0.03))
# Latency drops as cache warms up (fewer prefill FLOPs).
latency = max(0.4, 3.5 - hit_rate * 2.8 + random.uniform(-0.1, 0.1))
records.append(TurnRecord(
turn=i,
prompt_tokens=prompt_tokens,
completion_tokens=random.randint(60, 200),
latency_s=round(latency, 3),
cache_hit_rate=round(hit_rate, 4),
cache_usage_perc=round(min(0.85, 0.05 * i + random.uniform(0, 0.02)), 4),
))
return records
def _server_reachable(host: str) -> bool:
import requests
try:
r = requests.get(f"{host}/health", timeout=3)
return r.status_code == 200
except Exception:
return False
def main():
from reporter import print_summary_table, plot_session
if _server_reachable(VLLM_HOST):
print(f"vLLM server reachable at {VLLM_HOST}. Running live session...")
from session_simulator import run_session
records = run_session(VLLM_HOST, MODEL_ID)
else:
print(
f"vLLM server not reachable at {VLLM_HOST}. "
"Using synthetic data for demonstration."
)
records = _synthetic_records(n_turns=8)
print_summary_table(records)
out = plot_session(records)
print(f"Chart saved to: {out}")
print("monitoring_complete")
if __name__ == "__main__":
main()
Verify it Works
Run the monitoring script from the directory where you saved the five files. Because no vLLM server is running yet, it automatically uses synthetic data that mirrors the cache warm-up curve you would observe in a real agentic session. The chart is written to cache_efficiency.png in the same directory.
python monitor_cache.py
Verify the chart file was written:
ls -lh cache_efficiency.png
When you connect to a live GPU instance, start the server and run the monitor in two terminals:
# Terminal 1 — start vLLM
export HF_TOKEN=hf_...
export MODEL_ID=mistralai/Mistral-7B-Instruct-v0.3
bash launch_vllm.sh
# Terminal 2 — run the monitor once the server is healthy
python monitor_cache.py
Reading the Output
The four-panel chart shows:
- Prefix Cache Hit Rate per Turn: should climb from near 0 on turn 1 (cold cache) toward 0.7-0.9 by turn 5+ as the shared system prompt and conversation history fill the radix tree. The Irminsul paper [1] reports recovery of up to 83% of prompt tokens above exact-prefix on agentic traffic, so values in this range are expected on a warm cache.
- Prompt Token Count: grows linearly because each turn appends the assistant reply. A flat or slow-growing curve here means the conversation history is being truncated, which would also reset the cache.
- End-to-End Latency: should decrease as the hit rate rises. Cache misses force full prefill recomputation, which is the source of the 10-16 second TTFT spikes documented in [1].
- Tokens Served from Cache: the product of prompt tokens and hit rate. This is the prefill work the GPU skipped. At 63% prefill energy savings per cache hit [1], this panel translates directly to cost reduction.
Troubleshooting
docker: Error response from daemon: could not select device driver "nvidia": The NVIDIA Container Toolkit is not installed or not configured. Follow the NVIDIA Container Toolkit installation guide and run sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker.
CUDA out of memory during server startup: Reduce --gpu-memory-utilization to 0.80 or lower --max-model-len to 4096. On a 24 GB GPU, Mistral-7B needs roughly 14 GB for weights, leaving 10 GB for the KV cache pool.
Cache hit rate stays at 0.0 after several turns: Confirm --enable-prefix-caching is present in the Docker command. Also check that temperature is set to 0.0 in requests; sampling with temperature > 0 does not affect caching, but verify the model is not being reloaded between requests by checking Docker logs for Loading model weights.
/metrics returns 404: Some older vLLM images disable the Prometheus endpoint by default. Add --enable-metrics to the Docker command, or upgrade to vllm/vllm-openai:latest.
Latency does not decrease despite rising hit rate: The cache is warming but the bottleneck may be decode (token generation), not prefill. Use --max-tokens 64 in your test requests to isolate prefill latency, or inspect vllm:time_to_first_token_seconds in the Prometheus output directly.
ModuleNotFoundError: No module named 'metrics_scraper': Run monitor_cache.py from the directory where you saved the five files, or add that directory to PYTHONPATH: PYTHONPATH=. python monitor_cache.py.
Next Steps
- Integrate with Grafana: scrape the vLLM
/metricsendpoint with a Prometheus server and build a dashboard that alerts whengpu_prefix_cache_hit_ratedrops below 0.4 for more than 60 seconds. - Test with MLA-based models: DeepSeek-V2-Lite and Kimi Moonlight are the models evaluated in [1]. Their Multi-Head Latent Attention architecture makes content-addressed caching especially effective; swap
MODEL_IDto one of these and compare hit-rate curves against a GQA model. - Benchmark cache miss cost: add a
--no-enable-prefix-cachingrun of the same session and compare TTFT distributions. This quantifies the latency penalty your users pay when cache eviction occurs under load. - Extend to tool-call traces: replace
USER_TURNSwith real tool-call/result pairs from your agent framework. The system prompt plus tool schemas form a long shared prefix that is the primary beneficiary of prefix caching in production agentic deployments.