Observability for RAG Pipelines: The Metrics That Actually Matter in Production

You have a RAG pipeline running in production. You have Grafana dashboards showing CPU utilization, memory consumption, and request throughput. Your vector store reports average query latency of 45ms. Everything looks green. And yet, users are telling you the answers are getting worse.

This is the observability gap that plagues most production RAG deployments. The metrics you are tracking tell you whether the system is running. They do not tell you whether the system is working. There is a fundamental difference between infrastructure health ("is the vector store responding?") and retrieval quality ("is the vector store returning the right things?"), and most teams only measure the former.

Building effective observability for RAG pipelines requires a different mindset. You need metrics that serve as leading indicators of answer quality, not just trailing indicators of system health. This post covers the specific metrics, dashboard designs, and alerting strategies that we have seen work in production deployments processing millions of queries per month.

Infrastructure metrics vs. retrieval quality metrics

Let us start by clearly separating two categories of metrics, because conflating them is the root cause of most RAG observability failures.

Infrastructure metrics tell you about the operational health of your system components: vector store latency, embedding API throughput, database connection pool utilization, memory consumption, CPU usage, disk I/O. These are necessary but not sufficient. They answer the question "is the system able to process requests?" but not "is the system producing good results?"

Retrieval quality metrics tell you about the semantic quality of what your pipeline is producing: relevance of retrieved chunks, coverage of user queries, consistency of results over time, and the rate at which the system fails to find relevant information. These metrics are harder to compute, harder to interpret, and far more valuable for predicting user satisfaction.

The mistake teams make is assuming that healthy infrastructure implies healthy retrieval. A vector store can respond in 20ms with completely irrelevant results. An embedding API can process 1,000 queries per second and still produce embeddings that fail to capture the semantic nuances of your domain. Your infrastructure dashboard will be green while your users are frustrated.

The core retrieval metrics you should be tracking

Based on operational experience across dozens of production RAG deployments, these are the retrieval quality metrics that most reliably predict user-facing quality issues:

Retrieval latency percentiles (p50 / p95 / p99): Not just average latency - the tail matters enormously. A p50 of 50ms with a p99 of 2,000ms means one in a hundred users is waiting two full seconds just for retrieval, before generation even starts. Track these percentiles over time and alert on p95/p99 regressions, not average latency. A shift in the p95 from 200ms to 500ms often indicates an index that needs optimization, a query pattern that is triggering expensive scans, or a data volume that has crossed a performance threshold. The OpenTelemetry metrics data model provides a solid foundation for capturing histogram-based latency distributions.

Similarity score distributions: Track the distribution of top-k similarity scores across all queries. The median top-1 similarity score tells you how well your embeddings are matching queries to documents in general. More importantly, track the gap between top-1 and top-k scores. A large gap (e.g., top-1 at 0.92, top-5 at 0.61) suggests your retrievals are dominated by a single strong match. A small gap (top-1 at 0.75, top-5 at 0.72) suggests the retriever is not strongly differentiating between relevant and irrelevant results - a sign of embedding quality issues or a noisy corpus.

"No results" rate: The percentage of queries that return zero results above your similarity threshold. This is one of the most important metrics in your entire dashboard. A "no results" rate above 5% means a meaningful fraction of your users are getting responses based on zero retrieved context - which means the agent is either hallucinating or returning a generic "I don't know" response. Track this rate by query category if possible. A sudden spike in "no results" for a specific topic often means a knowledge gap: content that should be in your corpus is missing, has been deleted, or has been re-indexed in a way that changed its embedding.

Query volume over time: Simple, but essential. Track total query volume, broken down by tenant, user segment, query category, or any other dimension that is meaningful for your application. Volume trends help you forecast capacity needs, but more importantly, sudden changes in query volume patterns often signal upstream issues. A 50% drop in queries from a specific tenant might mean their integration is broken. A spike in queries with certain keywords might indicate a content gap that is driving users to ask the same question repeatedly.

Memory operation counts by type: If your agent system includes memory (episodic, semantic, procedural), track the volume of memory reads, writes, and deletes over time. A healthy agent system shows a steady ratio of reads to writes. A sudden spike in writes might indicate a runaway memory creation loop. A drop in reads might mean memory is not being utilized for personalization. Track these by memory category to understand which types of memory are most actively used.

Entity and edge growth rates: For knowledge graph-backed RAG systems, track the rate at which new entities and relationships are being added. Healthy knowledge graphs grow steadily as new information is ingested. A plateau might indicate that your entity extraction pipeline has stalled. A sudden spike might indicate duplicate entity creation. A decline might mean your ingestion pipeline is failing silently.

Building dashboards that tell a story

Raw metrics on a dashboard are not observability. Observability is the ability to understand system behavior from its outputs. Your dashboards need to tell a story that helps operators quickly assess system health and identify problems.

The most effective RAG observability dashboard layout we have seen uses a three-tier structure:

Tier 1 - Health Score (top of dashboard): A single composite score (0-100) that summarizes overall RAG pipeline health. This score is computed from weighted contributions of the key metrics: retrieval latency p95 within threshold (20% weight), "no results" rate below threshold (25% weight), similarity score distribution within expected range (25% weight), memory operation ratios normal (15% weight), and error rate below threshold (15% weight). The health score gives operators a single number to glance at. Green means everything is within normal parameters. Yellow means one or more metrics are trending toward their thresholds. Red means something needs immediate attention.
Tier 2 - Key Metrics (middle of dashboard): Time-series charts for each of the core metrics described above. Each chart shows the current value, the 7-day trend, and the alerting threshold. Operators who see a yellow or red health score can quickly scan Tier 2 to identify which specific metric is causing the degradation.
Tier 3 - Drill-Down (bottom of dashboard, linked panels): Detailed views that operators navigate to when investigating a specific metric. For example, clicking on the "no results" rate chart opens a drill-down showing the specific queries that returned no results, grouped by topic or keyword pattern. This is where operators transition from "something is wrong" to "here is what is wrong and here is how to fix it."

Alerting thresholds that reduce noise

Alerting is where most observability setups fail. Either the thresholds are too tight and the team is drowning in false positives, or they are too loose and real issues go unnoticed for hours. RAG-specific alerting requires a nuanced approach because retrieval quality metrics are inherently noisier than infrastructure metrics.

Static thresholds work for infrastructure metrics. If your vector store p99 latency exceeds 1 second, that is an alert. But for retrieval quality metrics, static thresholds are brittle. The "normal" similarity score distribution varies by query type, time of day, and corpus composition. A better approach is anomaly-based alerting: establish a rolling baseline for each metric over the past 7-14 days, and alert when the current value deviates by more than N standard deviations from the baseline.

Specific thresholds we recommend as starting points (to be calibrated per deployment):

Retrieval latency p95 exceeding 500ms for more than 5 minutes: warning. Exceeding 1,000ms for more than 2 minutes: critical. "No results" rate exceeding 5% over a 15-minute window: warning. Exceeding 10% over 5 minutes: critical. Median top-1 similarity score dropping below 0.70 over a 30-minute window: warning. Memory write volume increasing by more than 300% compared to the 7-day rolling average: warning (potential runaway loop).

We were tracking vector store latency and throughput religiously, and everything looked fine. It was not until we started tracking the 'no results' rate that we discovered 12% of our queries were returning empty results. The agent was handling these gracefully - generating plausible-sounding answers with no retrieval context - so users were getting confident wrong answers instead of 'I don't know.' Tracking that single metric exposed a class of silent failures we had no idea existed.

- Principal Engineer at a healthcare AI company

Correlating metrics with user outcomes

The ultimate validation of your observability setup is whether your metrics correlate with user outcomes. If your retrieval quality metrics say everything is fine but users are reporting poor answers, your metrics are measuring the wrong things. If your metrics degrade and users are happy, your thresholds are miscalibrated.

Building this correlation requires connecting your RAG metrics to user feedback weights: thumbs up/down ratings, support ticket volume, conversation abandonment rates, and follow-up query rates (users who immediately rephrase their question are likely unsatisfied with the first answer). For teams who have invested in systematic retrieval evaluation with recall, precision, and MRR metrics, you can also correlate production metrics against offline evaluation scores to calibrate your thresholds.

The goal is a closed loop: production metrics predict user outcomes, user outcomes validate production metrics, and the feedback between them continuously improves both your alerting thresholds and your pipeline quality.

Observability for multi-tenant deployments

If your RAG pipeline serves multiple tenants, you need per-tenant metric breakdowns. Aggregate metrics can hide tenant-specific problems. If one large tenant is performing well and nine small tenants are degraded, the aggregate metrics will look acceptable while 90% of your tenants are having a poor experience.

Track every core metric at both the aggregate and per-tenant level. Alert on per-tenant deviations as well as aggregate deviations. And provide tenant-specific dashboard views so that customer success teams can monitor the health of specific accounts without requiring engineering support.

Per-tenant observability also feeds into capacity planning. Understanding which tenants generate the most retrieval volume, the most memory operations, and the most complex queries helps you make informed decisions about resource allocation and pricing tier design. For a deeper discussion of multi-tenant architecture considerations, see our post on multi-tenant RAG access control and governance.

The operational maturity curve

Not every team needs every metric from day one. RAG observability maturity typically progresses through three stages:

Stage 1 - Infrastructure basics: Latency percentiles, error rates, throughput, and resource utilization. This is table stakes. If you are not here yet, start here.
Stage 2 - Retrieval quality: Similarity score distributions, "no results" rates, memory operation patterns, entity growth rates. This is where most teams start catching quality issues before users report them.
Stage 3 - Outcome correlation: Linking retrieval metrics to user feedback, building composite health scores, implementing anomaly-based alerting, and closing the feedback loop between production metrics and offline evaluation. This is where observability becomes a competitive advantage.

The Google SRE handbook on monitoring distributed systems provides an excellent framework for thinking about monitoring maturity. The principles of focusing on symptoms over causes, and using alerting for conditions that require human action, apply directly to RAG observability.

How TypeGraph approaches RAG observability

TypeGraph provides built-in observability that covers all three maturity stages out of the box. Every retrieval query, memory operation, and agent run is instrumented with both infrastructure metrics (latency, throughput, error rates) and retrieval quality metrics (similarity score distributions, "no results" rates, memory operation patterns). The dashboard surfaces a composite health score with drill-down views for each metric, and per-tenant breakdowns are available for multi-tenant deployments. Alerting thresholds are configurable and support both static and anomaly-based detection.

Running a RAG pipeline in production without retrieval quality observability is like driving at night with your headlights off. You might get away with it for a while, but when something goes wrong - and it will - you will not see it coming until it is too late.