Grafana Dashboards for Web Performance
A performance dashboard exists to answer one question under pressure: did the experience our users actually got just get worse, and where? Grafana is the open layer where self-hosted Real-User Monitoring (RUM) telemetry becomes a shared operational view — but only if the panels compute the right statistic, draw the right thresholds, and slice by the cohorts that move when a regression ships. This page is part of RUM Architecture, Tooling & Self-Hosting, and it assumes you already have field events landing in a store: typically a self-hosted RUM pipeline backed by ClickHouse or a Prometheus histogram fed from an OpenTelemetry collector for web RUM. Here we turn that store into dashboards that an SRE can trust at 3 a.m.
The single hardest thing to get right is the aggregation. RUM distributions are heavily right-skewed: a handful of slow sessions sit far out in the tail while the median looks healthy. The headline number for every Core Web Vital must therefore be the 75th percentile (p75) — the same statistic Google uses for field-data assessment — never the mean. The diagram below shows the path a single metric takes from the source table to a threshold-banded panel, which is the spine of everything that follows.
Threshold configuration
Every panel and alert is anchored to the current Google thresholds. Encode these once as Grafana threshold steps and reuse them everywhere — the band colours on a panel and the comparison value in an alert rule must come from the same constants, or your dashboard and your pager will disagree.
| Metric | Good (p75) | Needs Improvement (p75) | Poor (p75) | Engineering action when in NI/Poor |
|---|---|---|---|---|
| LCP | ≤ 2.5 s | ≤ 4.0 s | > 4.0 s | Trace the LCP element; check fetchpriority, preload, render-blocking CSS |
| INP | ≤ 200 ms | ≤ 500 ms | > 500 ms | Break up long tasks, yield to the scheduler, defer hydration |
| CLS | ≤ 0.1 | ≤ 0.25 | > 0.25 | Reserve space for media/ads, fix late-loading fonts |
| FCP | ≤ 1.8 s | ≤ 3.0 s | > 3.0 s | Reduce render-blocking resources, inline critical CSS |
| TTFB | ≤ 800 ms | ≤ 1.8 s | > 1.8 s | Edge-cache HTML, cut server work and redirects |
A practical rule: CLS is a unitless score, so it stays raw, but LCP, INP, FCP, and TTFB are durations. Decide on one unit per metric at the storage layer (seconds for LCP/FCP/TTFB, milliseconds for INP is common) and never mix them in a single panel. Grafana’s threshold steps are absolute numbers; a panel that stores LCP in milliseconds but bands it at 2.5 will paint everything green forever.
Computing p75 over the source
The aggregation query is where most dashboards quietly go wrong. There are two distinct families depending on your backend, and they compute percentiles in fundamentally different ways.
ClickHouse: exact-ish quantiles over raw rows
If your beacons land as one row per page view (the typical shape of a self-hosted ClickHouse pipeline), you compute the percentile directly from the values with quantile() (a reservoir-sampling estimate) or quantileExact() (precise, more memory). Group by a time bucket so Grafana can plot a series, and expose cohort columns for the dashboard variables to filter on.
-- p75 LCP per 5-minute bucket, filterable by device, country, and route.
-- $__timeFilter and ${var} are Grafana macros expanded at query time.
SELECT
toStartOfInterval(event_time, INTERVAL 5 MINUTE) AS t,
quantile(0.75)(lcp_ms) / 1000.0 AS p75_lcp_s,
quantile(0.75)(inp_ms) AS p75_inp_ms,
quantile(0.75)(cls_value) AS p75_cls,
count() AS samples
FROM rum_events
WHERE $__timeFilter(event_time)
AND device_class IN (${device:sqlstring})
AND country IN (${country:sqlstring})
AND route IN (${route:sqlstring})
GROUP BY t
HAVING samples >= 50 -- suppress noisy low-traffic buckets
ORDER BY t
The HAVING samples >= 50 guard matters: a p75 computed from eight sessions is meaningless and will produce alarming spikes during low-traffic windows. Surface the samples count as its own panel so anyone reading a regression can immediately see whether it rests on real volume. This is also where your sampling strategy and p75 aggregation choices become visible — if you head-sample at the beacon, the percentile is computed over the retained subset, and the minimum-sample threshold should be set with that retention rate in mind.
Prometheus: quantiles from histogram buckets
If your OpenTelemetry collector exports metrics as Prometheus histograms (one _bucket series per le boundary), you cannot compute an exact percentile — you interpolate within buckets using histogram_quantile. The accuracy is entirely bounded by your bucket layout, so define explicit boundaries around the threshold values (e.g. an LCP bucket at exactly 2.5 s) rather than relying on defaults.
# p75 LCP in seconds, by route, from a native or classic histogram.
histogram_quantile(
0.75,
sum by (le, route) (
rate(rum_lcp_seconds_bucket{
device_class=~"$device", country=~"$country"
}[$__rate_interval])
)
)
Use $__rate_interval rather than a hard-coded [5m] so the rate window tracks the dashboard’s resolution; a fixed window that is smaller than the scrape interval silently returns no data when someone zooms out to 7 days.
Panel and query reference
These are the panels that earn their place on a Core Web Vitals dashboard. Build them once, save them in a library, and reuse them per metric.
| Panel | Visualization | Query shape | Thresholds shown | Reads as |
|---|---|---|---|---|
| p75 trend per metric | Time series | quantile(0.75) / histogram_quantile(0.75) grouped by time |
Good/NI/Poor bands | Is it getting worse over time? |
| Sample volume | Time series (bars) | count() per bucket |
none | Can I trust the p75 above? |
| Current p75 | Stat | last bucket of the p75 query | colour by band | What is the number right now? |
| Cohort breakdown | Table | p75 grouped by route / device / country |
per-cell colour | Which cohort is dragging the aggregate? |
| Distribution | Heatmap | bucketed counts over time | none | Where is the tail moving? |
| Good-rate gauge | Gauge | countIf(lcp_ms <= 2500) / count() |
75% / 90% targets | What share of users get a Good experience? |
The cohort breakdown table is the most under-used panel and the most valuable during an incident. An aggregate p75 LCP that drifts from 2.3 s to 2.9 s tells you something broke; the table that shows the regression is isolated to device_class = "low-end-android" on route = /checkout tells you what broke.
Cohort variables
Dashboard variables turn one static dashboard into an investigative tool. The three that pay for themselves immediately are device class, country, and route. Define them as query variables so they populate from the data and stay current as new routes ship.
# Grafana dashboard variables (ClickHouse data source)
$device → SELECT DISTINCT device_class FROM rum_events WHERE $__timeFilter(event_time)
$country → SELECT DISTINCT country FROM rum_events WHERE $__timeFilter(event_time)
$route → SELECT DISTINCT route FROM rum_events WHERE $__timeFilter(event_time)
Mark each variable Multi-value and Include All, and reference them with the ${var:sqlstring} format in ClickHouse (or =~"$var" regex form in PromQL) so the “All” selection expands correctly. Keep route cardinality in check: a $route variable that lists 40,000 distinct URLs because someone stored full query strings will make the variable dropdown unusable and the GROUP BY route query slow. Normalise routes to their pattern (/products/:id, not /products/8472?ref=email) at ingestion, the same way you would template a span name.
Debugging workflow
When a p75 line crosses a threshold, work the regression in a fixed order rather than guessing. The goal is to move from “the aggregate is bad” to “this cohort, this release, this asset” before touching code.
- Identify the breached metric and band on the p75 trend panel, and confirm the sample-volume panel shows real traffic in the same window — rule out a low-volume artifact first.
- Segment with the cohort breakdown table: pivot the p75 by route, then device class, then country, until the regression collapses onto one or two cohorts.
- Correlate the timing of the inflection with deploys and config changes by overlaying a release annotation on the panel; a step change that lines up with a deploy is a code regression, a gradual drift is usually traffic-mix or third-party.
- Trace an exemplar slow session — in Grafana, click through from the panel to Explore, filter by the offending cohort, and pull individual beacons (or the linked OpenTelemetry spans) to see the resource waterfall and attribution.
- Validate the suspected fix in the lab against that exact cohort’s conditions (throttle to the device class and network you isolated) before shipping.
- Monitor the delta: after deploy, watch the same p75 panel filtered to the affected cohort and confirm the band recovers; keep the annotation in place so the recovery is documented for the next incident.
Field-data segmentation patterns
The aggregate hides almost everything interesting. The segmentations worth building as saved views, and the divergence each one is designed to expose:
- Device class (high-end / mid / low-end). INP and LCP diverge sharply by CPU. A healthy global p75 INP can mask a Poor experience for the low-end-Android cohort, which is often the largest segment in emerging markets. Watch for the device gap widening after a JavaScript bundle grows.
- Country / region. TTFB and LCP track geographic distance to your edge. A regional spike in TTFB usually means a CDN edge fell out of cache or a region failed over to origin — visible in the country breakdown long before the global p75 moves.
- Route / page type. Different templates have different performance budgets. Segmenting by route stops a heavy marketing landing page from masking a fast checkout, and vice versa. Route-level CLS spikes frequently trace to a single template’s ad or hero slot.
- Connection type (4G / 3G / wifi). Pairs with device class to separate “slow because the phone is slow” from “slow because the network is slow” — two regressions with completely different fixes.
The pattern to internalise: a regression that is uniform across all cohorts is usually a backend or third-party change; a regression isolated to one cohort is usually a frontend change interacting with that cohort’s constraints.
Alerting on p75 regressions
Alerts should fire on the same statistic the dashboard shows, with enough for duration to ride out single noisy buckets. Define them as Grafana alert rules (or Prometheus rules feeding Grafana) so they live in the same provisioning pipeline as the dashboards.
groups:
- name: web_performance_alerts
rules:
- alert: P75_LCP_Regression
expr: |
histogram_quantile(0.75,
sum by (le) (rate(rum_lcp_seconds_bucket[10m]))
) > 2.5
for: 15m
labels:
severity: warning
annotations:
summary: "p75 LCP above the 2.5s Good threshold for 15m"
- alert: P75_INP_Regression
expr: |
histogram_quantile(0.75,
sum by (le) (rate(rum_inp_seconds_bucket[10m]))
) > 0.2
for: 15m
labels:
severity: warning
annotations:
summary: "p75 INP above 200ms for 15m"
- alert: P75_CLS_Regression
expr: |
histogram_quantile(0.75,
sum by (le) (rate(rum_cls_bucket[10m]))
) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "p75 CLS above 0.1 for 10m"
Two refinements separate a useful alert from a noisy one. First, gate every alert behind a minimum sample count in the same window — an and on() (sum(rate(rum_lcp_seconds_count[10m])) > 0.5) clause stops a single slow session at 4 a.m. from paging anyone. Second, alert on the Needs Improvement boundary, not Poor: by the time p75 LCP crosses 4.0 s, a meaningful slice of users have already had a bad experience. Warning at 2.5 s gives you the lead time to act before it becomes a Poor-rated URL group in Search Console.
Dashboards as code
A dashboard that exists only in someone’s browser is an outage waiting to happen. Grafana represents every dashboard as a JSON model, and that model belongs in version control next to the rest of your infrastructure. Provisioning then makes Grafana load it on startup, so the dashboard is reproducible, reviewable, and diffable.
Drop a provider file in /etc/grafana/provisioning/dashboards/:
apiVersion: 1
providers:
- name: web-performance
type: file
disableDeletion: true
allowUiUpdates: false # edits must go through the repo, not the UI
options:
path: /var/lib/grafana/dashboards/web-performance
foldersFromFilesStructure: true
Each panel in the committed JSON carries its threshold steps inline, so the Good/NI/Poor bands are part of the reviewed artifact rather than a manual click. Here is the threshold block for a p75 LCP time-series panel — note the values match the table above exactly:
{
"title": "p75 LCP (s)",
"type": "timeseries",
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 2.5 },
{ "color": "red", "value": 4.0 }
]
},
"custom": { "thresholdsStyle": { "mode": "area" } }
}
}
}
Because allowUiUpdates is false, the only way to change a band or a query is a pull request, which gives you review and an audit trail for every threshold change.
CI and provisioning checks
Treat the dashboard JSON as code that can break. A lightweight gate in CI catches the failures that otherwise surface as blank panels in production.
#!/usr/bin/env bash
set -euo pipefail
# 1. Every committed dashboard must be valid JSON.
for f in dashboards/web-performance/*.json; do
jq empty "$f" || { echo "Invalid JSON: $f"; exit 1; }
done
# 2. Guard against the classic bug: an average sneaking into a CWV panel.
if grep -RnE '\bavg\(|"reducers".*"mean"' dashboards/web-performance/*.json; then
echo "Mean aggregation found in a Core Web Vitals panel — use p75."
exit 1
fi
# 3. Validate the dashboard against a live Grafana before merge.
curl -fsS -H "Authorization: Bearer ${GRAFANA_TOKEN}" \
-H "Content-Type: application/json" \
-X POST "${GRAFANA_URL}/api/dashboards/db" \
--data @<(jq '{dashboard: ., overwrite: true, folderUid: "web-perf"}' \
dashboards/web-performance/lcp.json) \
> /dev/null && echo "Dashboard import OK"
The grep step is deliberately blunt: averaging instead of percentiles is the single most common defect in performance dashboards, and a one-line gate is cheaper than a quarter of misleading green panels.
Failure modes and gotchas
- Averaging instead of percentiles. An
avg()over a right-skewed distribution reports a number no real user experienced and stays green through regressions that are visible at p75. This is the cardinal sin — gate against it in CI as shown above. - Timezone and window mismatch. Store and query timestamps in UTC end to end. If beacons carry local time while ClickHouse aggregates in UTC, your daily traffic curve smears and
toStartOfIntervalbuckets land on the wrong hour, making deploy-correlation impossible. Likewise, a Prometheusrate()window narrower than the scrape interval returns empty buckets at coarse zoom levels. - High-cardinality variables. A
$routeor$urlvariable populated from unnormalised URLs (query strings, IDs) explodes the variable dropdown, slowsGROUP BY, and in Prometheus can blow up series cardinality until ingestion stalls. Normalise to route patterns before storage. - Histogram bucket boundaries. With
histogram_quantile, accuracy is capped by bucket edges. If no bucket boundary sits near 2.5 s, your p75 LCP is interpolated across a wide bucket and can read meaningfully high or low. Place explicit boundaries at the threshold values. - Low-sample noise. Quantiles over a handful of sessions swing wildly. Always pair a p75 panel with a sample-count panel and apply a
HAVING/and on()floor before alerting. - Unit drift. Mixing milliseconds and seconds between storage, query, and threshold steps produces panels that are silently wrong. Pin one unit per metric and assert it in review.
FAQ
Why p75 instead of the average or median for Core Web Vitals?
Field-data distributions are right-skewed: the mean is pulled around by a small number of very slow sessions, and the median can look healthy while a quarter of users suffer. Google assesses Core Web Vitals at the 75th percentile, so dashboards and alerts must use p75 to match what affects ranking and what users in the tail actually experience.
Should I use ClickHouse quantile() or Prometheus histogram_quantile()?
If you store one row per page view, quantile() (or quantileExact() for precision) over the raw values in ClickHouse is the most accurate. If your pipeline emits Prometheus histograms, you must use histogram_quantile(), whose accuracy depends entirely on bucket boundaries — define explicit buckets near each threshold value. Raw rows give exactness; histograms give cheaper storage and cardinality control.
How do I stop low-traffic windows from triggering false alerts?
Guard both panels and alerts with a minimum sample count. In ClickHouse add HAVING count() >= 50; in Prometheus alerts, and on() the rule with a rate(..._count[10m]) > threshold clause. A p75 computed from a handful of sessions is statistically meaningless and will spike during quiet hours.
Where should the Good/Needs Improvement/Poor thresholds live?
Inline in the committed dashboard JSON as absolute thresholds.steps, and in the alert rule expressions — both sourced from the same constants. Keeping them in version control means a band change goes through review, and it prevents the dashboard and the alerts from drifting apart.
How do I keep dashboards reproducible across environments?
Commit the JSON models to a repository and load them through Grafana file provisioning with allowUiUpdates: false, so changes only happen via pull request. Add a CI step that validates the JSON, blocks averaging in CWV panels, and imports each dashboard against a live Grafana before merge.
Related
- Building a Core Web Vitals Grafana Dashboard — the full step-by-step build of the dashboard described here.
- OpenTelemetry for Web RUM — standardize browser telemetry into spans and metrics that feed these panels.
- RUM Data Sampling Strategies — how sampling and p75 aggregation interact, and how to set minimum-sample floors.
- Self-Hosted RUM Pipeline with ClickHouse — the columnar store backing the quantile queries above.
- Self-Hosted Beacon Collection — the ingestion endpoint that lands the events these dashboards read.