Grafana Ships AI Observability Preview and o11y-bench

What Grafana Shipped

Both releases arrived within the same product cycle and share a common motivation: closing the gap between how engineering teams monitor conventional cloud-native workloads and how they monitor AI-driven systems. AI Observability in Grafana Cloud enters public preview as a managed product, while o11y-bench is published under an open-source license in the grafana/o11y-bench GitHub repository [1]. Together, they address separate but related problems. The cloud product gives operations teams runtime visibility into agents already running in production. The benchmark gives developers and researchers a structured way to measure how capable an agent actually is before or during deployment in observability contexts [2].

AI Observability in Grafana Cloud: Core Capabilities

Traditional observability surfaces signals such as CPU usage, request latency, and error rates. Those signals do not indicate whether an agent is producing helpful outputs, hallucinating, or degrading quietly over time [2]. AI Observability in Grafana Cloud is designed to fill that gap by treating agent chats and sessions as first-class telemetry signals alongside conventional infrastructure metrics.

The product monitors agent sessions, individual tool calls, LLM spans, and quality signals in real time. It ingests OpenTelemetry-based traces from agentic systems, meaning teams that already instrument their services with OpenTelemetry can extend that pipeline to cover AI workloads without adopting a separate collection mechanism [2]. Continuous evaluation runs against agent outputs, with alerting configured for low-quality responses, policy violations, or anomalous behavior patterns. The feature originated as an internal Grafana hackathon project before the company decided to productize it after hearing from customers facing similar monitoring challenges [2].

How o11y-bench Works

o11y-bench is built on Harbor, an open-source framework from the creators of Terminal Bench that standardizes environments for benchmarking agents against focused task sets [1]. The benchmark runs agents against a live Grafana stack that has access to the Grafana MCP server, providing the same tool surface an agent would encounter in a real deployment.

Graded task categories cover the workflows that arise most frequently in practice: querying metrics, logs, and traces; investigating incidents; and making targeted dashboard changes [1]. Scoring is not limited to syntactic correctness. A query can be syntactically valid and still select the wrong series; a dashboard can render without errors and still display misleading data. The benchmark is designed to catch those failure modes by evaluating outcomes against the actual state of the Grafana environment after the agent acts [1].

Why Observability Tasks Require a Dedicated Benchmark

Generic coding benchmarks do not capture the complexity of observability work. Root-cause investigations and dashboard creation depend on the interaction between large volumes of metrics, logs, traces, time ranges, and saved application state [1]. That combination of variables makes it difficult to determine whether an agent completed a task correctly using standard pass-fail criteria.

The domain also requires judgment calls that go beyond tool invocation. In a real incident, the difficult step is rarely writing a query. It is deciding which signal matters, determining whether a spike represents noise or a genuine symptom, and correlating data across multiple telemetry types simultaneously [1]. Stateful operations such as dashboard edits add another layer of complexity, because a change that renders correctly may still break a view that another engineer depends on. These properties collectively justify a benchmark scoped specifically to observability workflows rather than adapted from general software-engineering evaluations.

Who These Tools Target

AI Observability in Grafana Cloud is aimed primarily at platform engineering and SRE teams that are already running agents in production and need runtime visibility into agent behavior, output quality, and failure modes [2]. The product fits into existing Grafana Cloud deployments and extends current telemetry pipelines through OpenTelemetry ingestion.

o11y-bench addresses a different audience. Researchers evaluating model capabilities on domain-specific tasks, developers building observability-focused AI agents, and teams conducting pre-deployment evaluations of agent quality are the primary users [1]. Because it runs against a real Grafana stack rather than a simulated environment, results are intended to reflect performance on actual production tooling rather than synthetic proxies.

FAQ

Q. Does AI Observability in Grafana Cloud require replacing an existing OpenTelemetry pipeline? No. The product ingests OpenTelemetry-based traces from agentic systems, so teams already using OpenTelemetry instrumentation can extend their existing pipelines to cover AI workloads [2]. A separate collection mechanism is not required.

Q. What Grafana version or plan is needed to access the public preview? Grafana has not published specific plan or version prerequisites in the available release materials. The feature is described as available in Grafana Cloud and documentation is linked from the official Grafana Cloud machine-learning section [2]. Teams should consult that documentation for current access requirements.

Q. Can o11y-bench be run against a self-hosted Grafana instance, or does it require Grafana Cloud? The benchmark runs agents against a real Grafana stack using the Grafana MCP server [1]. The repository documentation would contain the specific environment setup instructions, but the architecture is designed around a live stack rather than a cloud-only dependency.

Q. How does o11y-bench handle tasks where multiple correct answers exist, such as equivalent but differently structured queries? The benchmark grades outcomes against the actual state of the Grafana environment after the agent acts, rather than checking for a single expected string [1]. This approach is intended to accommodate equivalent solutions, though the precise scoring rubric for each task category is detailed in the repository.

Q. Is o11y-bench compatible with agents built on any framework, or does it assume a specific tool-calling interface? The benchmark is built on the Harbor framework and uses the Grafana MCP server as the tool interface [1]. Any agent capable of interacting with an MCP server can be evaluated, which covers a broad range of agent frameworks without requiring framework-specific adapters.

Key Takeaways

AI Observability in Grafana Cloud is available now in public preview and monitors agent sessions, tool calls, LLM spans, and quality signals by ingesting OpenTelemetry-based traces alongside conventional telemetry [2].
o11y-bench is open-source, built on the Harbor framework, and runs agents against a live Grafana stack with Grafana MCP server access to grade performance on incident investigation, metric querying, and dashboard editing tasks [1].
Generic coding benchmarks do not capture the judgment-intensive, stateful nature of observability work, which is the technical rationale for a domain-specific evaluation tool [1].
Platform and SRE teams running agents in production are the primary audience for the Grafana Cloud product, while researchers and agent developers are the primary audience for the benchmark.
Teams can access the public preview through Grafana Cloud and find the benchmark at the grafana/o11y-bench GitHub repository [1][2].