RUM Data Sampling Strategies
At a few million sessions a day, ingesting every navigation, resource entry, and interaction beacon stops being a storage problem and becomes a statistics problem: the moment you drop beacons carelessly, your reported percentiles drift away from what users actually experience. This page, part of RUM Architecture, Tooling & Self-Hosting, covers how to cut beacon volume by an order of magnitude while keeping your headline p75 numbers honest — by sampling at the session level, hashing deterministically, oversampling the cohorts that matter, and reweighting at query time so the p75 aggregation and field-data sampling you report back to the business survives the cut.
The single most common production failure is silent: a team enables sampling to save money, the dashboards keep rendering plausible-looking numbers, and nobody notices that mobile sessions on flaky networks — the exact cohort that fails LCP — have been quietly thinned out of the dataset. The percentiles look better after sampling, which is precisely the tell that the sampling is biased.
Sample sessions, not events
The first decision dwarfs every tuning knob that follows: sample at the session level, never at the event level. A single session emits one LCP value, a stream of INP candidates that resolve to one final value, one or more CLS shift windows, plus FCP and TTFB navigation timings. If you make an independent random keep/drop decision for each event, you fracture the session: you keep the LCP beacon but drop the INP beacon for the same user, or you keep an early CLS window and drop the late one that pushed the session over the 0.25 Poor threshold.
Event-level sampling corrupts your data in two ways at once. First, you can no longer compute per-session derived metrics — “what fraction of sessions had both poor LCP and poor INP?” becomes unanswerable because the two metrics are no longer co-present for the same session. Second, CLS is itself a sum over the session’s worst shift window; dropping individual shift entries systematically understates CLS, because the layout-shift entries you happened to drop were contributing to the running total. The metric is not a point sample — it is an aggregate of the session, and sampling its constituents biases it downward.
Session-level sampling fixes both problems: you make exactly one keep/drop decision per session, at session start, and then every beacon that session produces inherits that decision. The session is your atomic unit. This is why the keep decision must be deterministic on the session id — the web-vitals API implementation on the page may fire the LCP callback at first paint, the INP callback at visibilitychange, and the CLS callback at pagehide, potentially across multiple beacon flushes. A re-rolled random number on each flush would give inconsistent answers. A hash of the session id gives the same answer every time, for the lifetime of the session, with no state to persist.
| Decision | Event-level sampling | Session-level sampling |
|---|---|---|
| Keep unit | Individual beacon | Whole session |
| Per-session metrics | Broken (metrics decorrelated) | Intact |
| CLS fidelity | Biased downward (shift entries dropped) | Correct |
| Reweighting | Ambiguous (which weight?) | One weight per session |
| Client state required | None, but wrong | None |
| Verdict | Avoid | Use this |
Head-based vs tail-based, and why you need both
Sampling has two fundamentally different shapes, and production RUM needs a blend of them. The trade-offs are deep enough that they get their own treatment in Head-Based vs Tail-Based Sampling for RUM; the summary you need to design against is below.
Head-based sampling decides at the start of the session, before any metric value is known, purely from the session id and static context (device class, country, route). It is cheap, stateless, and the only option for a pure client sampler — the browser cannot know the LCP will be terrible before it happens. Its weakness is that it is blind to outcomes: a flat head-based rate keeps slow and fast sessions in identical proportion, which is statistically correct for percentiles but wastes budget on the dense middle of the distribution and thins out the rare tail you most want to debug.
Tail-based sampling decides after the session ends, when the LCP, INP, and CLS values are known, so it can preferentially keep the interesting sessions — the ones above the Poor thresholds, the ones with errors. This is what lets you keep 100% of sessions where INP exceeded 500 ms while keeping only 2% of the fast ones. Its cost is that the decision moves server-side: the client must beacon enough signal for the collector to make the tail decision, or the client buffers until session end and self-classifies.
| Property | Head-based | Tail-based |
|---|---|---|
| Decision time | Session start | Session end |
| Inputs available | Session id, static context | Final LCP / INP / CLS, errors |
| Where it runs | Client (stateless) | Collector or buffered client |
| Cost | Lowest | Higher (must see signal first) |
| Best at | Cheap uniform volume reduction | Preserving the slow tail |
| Bias risk | Low if rate is per-cohort consistent | High if you forget to reweight the kept tail |
The production pattern is a two-stage sampler: a head-based base rate keeps a representative spine of the full distribution (so unbiased percentiles are computable), and a tail-based rule rescues the Poor-threshold and error sessions the base rate would have discarded. Each stage carries its own keep probability, and therefore its own reweighting factor — which is where most implementations go wrong.
Stratified oversampling of the cohorts you cannot lose
A single global keep rate is the enemy of mobile and emerging-market visibility. If 8% of your sessions are on slow-2g/2g connections and you keep a flat 5% of everything, you end up with a handful of slow-network sessions per day — far too few to compute a stable per-cohort p75, and so noisy that a real regression hides inside the sampling variance. The fix is stratified sampling: define cohorts, and give each cohort its own keep rate, oversampling the sparse-but-important ones.
The cohorts that almost always warrant oversampling:
- Mobile / low-end device tier — the slow side of every Core Web Vitals distribution; usually undersampled relative to its UX importance.
- Emerging-market / high-latency geographies — sparse in absolute volume, high in TTFB and LCP, often the cohort a CDN change regresses first.
- Error sessions — JS errors, failed fetches, soft-404s; you want every one, because they are rare and forensically expensive to reproduce.
- Revenue-critical routes — checkout, signup, search; oversample so a route-specific INP regression is detectable within hours, not days.
The decision table below is the contract between the client SDK and the warehouse. The weight column is the load-bearing part: it is 1 / keep_rate, and it is what makes a 5%-kept fast session and a 50%-kept mobile session contribute correctly to the same blended percentile.
| Cohort | Keep rate | Weight (1 / rate) | Rationale |
|---|---|---|---|
| Default (desktop, fast net) | 0.05 | 20 | Dense middle; 5% is ample for a stable p75 |
| Mobile / low-end device | 0.25 | 4 | Sparse + slow; protect the tail of the distribution |
slow-2g / 2g network |
0.50 | 2 | Very sparse; near-total capture |
| Emerging-market geo bucket | 0.30 | 3.33 | Low volume, high LCP/TTFB variance |
| Checkout / signup routes | 0.50 | 2 | Revenue-critical; fast regression detection |
| Error sessions (tail rule) | 1.00 | 1 | Always keep; never reconstructable later |
A session can match several cohorts. Resolve to a single keep rate by taking the maximum keep rate across all matched cohorts (keep more, never less), and set the weight to the inverse of that resolved rate. Never multiply rates together — that double-discounts and re-biases the percentile.
The deterministic client sampler
The client side is pure and stateless: hash the session id, compare against the resolved cohort threshold, and stamp the chosen weight onto every beacon so the warehouse never has to recompute it. The hash must be the same on every flush within a session, which is why we hash the session id (stable for the session’s lifetime) rather than re-rolling Math.random().
// Deterministic, session-stable sampler. No persisted state required:
// the same session_id always produces the same keep decision and weight.
// cyrb53: fast, non-cryptographic 53-bit hash. Stable across page loads.
function cyrb53(str, seed = 0) {
let h1 = 0xdeadbeef ^ seed, h2 = 0x41c6ce57 ^ seed;
for (let i = 0; i < str.length; i++) {
const ch = str.charCodeAt(i);
h1 = Math.imul(h1 ^ ch, 2654435761);
h2 = Math.imul(h2 ^ ch, 1597334677);
}
h1 = Math.imul(h1 ^ (h1 >>> 16), 2246822507) ^ Math.imul(h2 ^ (h2 >>> 13), 3266489909);
h2 = Math.imul(h2 ^ (h2 >>> 16), 2246822507) ^ Math.imul(h1 ^ (h1 >>> 13), 3266489909);
return 4294967296 * (2097151 & h2) + (h1 >>> 0);
}
// Map the session hash to a uniform float in [0, 1). Deterministic per session.
function sessionUnitInterval(sessionId) {
return cyrb53(sessionId) / 9007199254740992; // 2^53
}
// Resolve the keep rate by taking the MAX rate across matched cohorts.
// Returns { rate, weight, cohort } — weight is 1 / rate, stamped onto beacons.
function resolveSampling(ctx) {
const rules = [
{ cohort: 'default', rate: 0.05, match: () => true },
{ cohort: 'mobile', rate: 0.25, match: (c) => c.deviceTier === 'low-end' || c.isMobile },
{ cohort: 'slow-net', rate: 0.50, match: (c) => c.effectiveType === 'slow-2g' || c.effectiveType === '2g' },
{ cohort: 'geo-em', rate: 0.30, match: (c) => c.geoBucket === 'emerging' },
{ cohort: 'checkout', rate: 0.50, match: (c) => /^\/(checkout|signup)/.test(c.path) },
];
let best = rules[0];
for (const r of rules) {
if (r.match(ctx) && r.rate > best.rate) best = r;
}
return { rate: best.rate, weight: 1 / best.rate, cohort: best.cohort };
}
function buildSessionContext() {
const conn = navigator.connection || {};
return {
isMobile: matchMedia('(pointer: coarse)').matches,
deviceTier: (navigator.deviceMemory || 8) <= 2 ? 'low-end' : 'normal',
effectiveType: conn.effectiveType || 'unknown',
geoBucket: document.documentElement.dataset.geoBucket || 'core', // set server-side
path: location.pathname,
};
}
// Decide once per session; cache the decision on a module-level singleton.
let DECISION = null;
function samplingDecision(sessionId) {
if (DECISION) return DECISION;
const { rate, weight, cohort } = resolveSampling(buildSessionContext());
const keep = sessionUnitInterval(sessionId) < rate;
DECISION = { keep, weight, cohort, rate };
return DECISION;
}
// Tail rescue: force-keep a session the head rate would have dropped,
// e.g. on the first uncaught error or a known-bad final INP. Weight = 1.
function forceKeepAsTail() {
DECISION = { keep: true, weight: 1, cohort: 'tail-error', rate: 1 };
}
// Usage at beacon-build time. The weight rides along in the payload.
function decorateBeacon(beacon, sessionId) {
const d = samplingDecision(sessionId);
if (!d.keep) return null; // drop entire session's beacon
beacon.sampling_weight = d.weight; // warehouse divides by this implicitly
beacon.sampling_cohort = d.cohort;
return beacon;
}
Wire decorateBeacon into the same lifecycle that flushes vitals — the visibilitychange/pagehide sendBeacon path described in self-hosted beacon collection. Because the decision is memoised on DECISION, the LCP beacon at first paint and the INP/CLS beacon at page hide carry the identical weight and cohort, and a session that is dropped emits nothing at all — saving the network round-trip, not just the storage.
Reweighting at query time to recover unbiased percentiles
Here is the rule that the entire strategy rests on: a kept session represents 1 / keep_rate real sessions, and every aggregate must account for that weight. A raw quantile() over the kept rows answers the wrong question — it gives you the p75 of the sample, in which oversampled mobile sessions are wildly over-represented, dragging the blended percentile toward the slow tail. The correct p75 is the weighted quantile, where each row counts for its sampling_weight.
-- BigQuery: weighted percentiles via cohort scaling.
-- Reweighting recovers the population p75 from a stratified sample.
WITH kept AS (
SELECT
session_id,
metric_name,
metric_value,
sampling_weight, -- = 1 / per-session keep rate, set on the client
sampling_cohort
FROM `rum.beacons`
WHERE event_date = CURRENT_DATE()
AND metric_name = 'LCP'
),
-- Expand each kept session into sampling_weight virtual rows, then take the
-- ordinary percentile of the expanded (population-representative) set.
expanded AS (
SELECT metric_value
FROM kept, UNNEST(GENERATE_ARRAY(1, CAST(ROUND(sampling_weight) AS INT64))) AS _
)
SELECT
APPROX_QUANTILES(metric_value, 100)[OFFSET(75)] AS p75_lcp_ms,
(SELECT COUNT(*) FROM kept) AS kept_sessions,
(SELECT SUM(sampling_weight) FROM kept) AS estimated_population
FROM expanded;
Row-expansion is the most portable way to express a weighted quantile and reads clearly in review, but it is O(sum(weight)) and will blow up at high weights. In ClickHouse, prefer the native weighted quantile, which is exact and avoids the expansion entirely:
-- ClickHouse: native weighted quantile, no row expansion.
SELECT
quantileExactWeighted(0.75)(metric_value, toUInt64(round(sampling_weight))) AS p75_lcp_ms,
count() AS kept_sessions,
sum(sampling_weight) AS estimated_population
FROM rum_beacons
WHERE metric_name = 'LCP'
AND beacon_date = today();
The same weighting applies to every aggregate, not just percentiles. Conversion rates, error rates, and “share of sessions that are Poor” must all be computed as weighted ratios (sum(weight) WHERE poor / sum(weight)), never as raw row ratios. The moment one dashboard panel forgets the weight, it silently reports the sample’s composition instead of the population’s — and because oversampling deliberately skews the sample, that panel will read worse than reality for whatever cohort you oversampled.
Cost vs fidelity
Sampling buys cost reduction at the price of statistical precision, and the exchange rate is non-linear. Halving the keep rate roughly doubles the standard error of a percentile estimate, so the marginal fidelity you lose grows as rates fall. The practical sweet spot for a high-traffic site is a low single-digit base rate with aggressive cohort oversampling — total kept volume lands far below the unsampled cost, while the cohorts that drive business decisions stay statistically stable.
| Lever | Effect on cost | Effect on fidelity | When to pull it |
|---|---|---|---|
| Lower base keep rate | Large saving | Lower precision on dense cohorts | Desktop p75 is already rock-stable |
| Raise cohort oversampling | Modest cost increase | Stabilises sparse-cohort p75 | Mobile/geo p75 is noisy day-to-day |
| Add tail rescue rule | Small cost increase | Preserves Poor-threshold tail | You debug slow sessions and find none kept |
| Shorten aggregation window | None (query-side) | Lower precision per window | Only if you have the volume to spare |
A useful guardrail: never let any reported cohort fall below a few hundred kept sessions per aggregation window. Below that, sampling variance swamps real signal and your p75 jitters enough to trip false regression alerts. If a cohort dips under the floor, raise its keep rate, not the global one.
Debugging workflow
When a sampled dashboard disagrees with reality — or with an unsampled control — work the problem in this order:
- Reproduce against a control cohort. Keep a small slice of traffic (1–5%, selected by a separate hash domain) at 100% with weight 1. Compare its p75 against the sampled-and-reweighted p75 for the same window. They should agree within sampling error; a persistent gap means a reweighting bug.
- Audit the weight, not the value. Query
SELECT sampling_cohort, avg(sampling_weight), count() FROM ... GROUP BY 1. If the average weight per cohort is not the inverse of that cohort’s keep rate, the client and warehouse disagree about the rate — fix the contract, not the SQL. - Check keep-rate stability per cohort.
count() / estimated_populationper cohort should match the configured keep rate. A cohort keeping far more or less than configured points at a misfiringmatch()predicate or a context field (e.g.geoBucket) not being populated. - Verify hash stability across flushes. Log
sampling_decisionat both the first-paint flush and the page-hide flush for a canary build. If a session ever flips keep/drop mid-session, the hash input is not session-stable — usually a session id that regenerates on SPA route change. - Diff against an unsampled day. If you can afford one unsampled hour per week, store it and compare full vs reweighted percentiles end to end. This catches whole classes of bias the control cohort can miss.
- Watch the kept-session floor. Alert when any reported cohort drops below its minimum kept-session count; treat it as a data-quality incident, not a performance regression.
Field segmentation: what divergence to watch
Stratified sampling exists to make segmentation trustworthy, so the segments you oversample are exactly the ones to scrutinise:
- Device class. Low-end and mobile should show materially worse LCP and INP than desktop. If your sampled mobile p75 looks suspiciously close to desktop, you are almost certainly still under-keeping mobile or under-weighting it at query time.
- Network type.
slow-2g/2gshould have a long TTFB and LCP tail. A flat distribution here is the classic sign that flaky-network beacons are being dropped before they send — a beacon-delivery bug masquerading as a sampling result. - Geography. Emerging-market buckets should diverge from core geos on TTFB. If a CDN or origin change regresses a single region, the oversampled geo cohort is where you will see it first, days ahead of the blended p75.
- Route. Revenue routes should be individually stable thanks to oversampling. A route-level INP regression that does not move the global number is exactly the case oversampling is meant to surface.
Failure modes and gotchas
- Event sampling masquerading as session sampling. The headline failure. Independent per-beacon decisions decorrelate metrics and bias CLS downward. Always make exactly one decision per session id.
- Sampling the tail away. A flat head-based rate keeps so few Poor-threshold sessions that you cannot debug them. Add a tail rescue rule that force-keeps error and over-threshold sessions at weight 1.
- Forgetting to reweight. Raw
quantile()over a stratified sample reports the sample’s composition, which oversampling deliberately distorts. Every aggregate must use the weight. This is the bug that makes oversampled mobile p75 look implausibly bad — or, after a half-fix, implausibly good. - Double-discounting overlapping cohorts. A session matching both “mobile” and “checkout” must resolve to one rate (the max), with weight = 1 / that rate. Multiplying the rates re-biases the percentile.
- Non-stable session ids. If the session id regenerates on SPA navigation, the hash input changes and the keep decision flips mid-session, splitting one session into kept and dropped fragments. Pin the session id for the session lifetime.
- Beacon loss disguised as sampling. Dropped beacons on flaky networks bias the result exactly like under-sampling mobile, but no weight can correct for data that never arrived. Use
sendBeacononpagehide, and treat high beacon-drop rates on slow networks as a delivery bug, not a sampling tuning issue. - Weight precision drift. Storing
sampling_weightas a float and rounding inconsistently between client and warehouse introduces small per-cohort bias. Pick one rounding rule (round to nearest integer at query time) and apply it everywhere.
CI/CD gating
Sampling correctness is testable in CI, and it should be — a reweighting regression is invisible in production until someone manually audits it.
- Reweighting unit test. In CI, generate a synthetic population with a known p75, run it through the sampler and the reweighting query, and assert the recovered p75 is within tolerance of ground truth. This catches double-discounting and missing-weight bugs before they ship.
- Determinism test. Assert
samplingDecision(id)returns an identical{ keep, weight }across repeated calls and across simulated flushes — guards hash stability. - Cohort coverage gate. Fail the build if any configured cohort would keep zero sessions at the configured rate against a representative traffic fixture (a misconfigured predicate that silently matches nothing).
- Percentile-drift gate. In the nightly aggregation pipeline, compare the reweighted p75 against the control-cohort p75 and fail the job if they diverge beyond sampling error — turning silent bias into a loud, blocking signal. Wire this into the same pipeline that feeds your Grafana dashboards for web performance.
FAQ
Why sample sessions instead of individual events?
A session’s metrics are correlated: one LCP, one final INP, and a CLS that is itself a sum over the session’s worst shift window. Independent per-event sampling decorrelates those metrics (you cannot ask “share of sessions poor on both LCP and INP”) and biases CLS downward by dropping contributing shift entries. One keep/drop decision per session, deterministic on the session id, keeps the session intact and gives every beacon a single unambiguous weight.
How do I recover the true p75 after oversampling mobile traffic?
Attach a sampling_weight of 1 / keep_rate to every kept session and use a weighted quantile at query time (quantileExactWeighted in ClickHouse, weighted-row expansion or an equivalent UDF in BigQuery). The weight makes a 5%-kept desktop session and a 50%-kept mobile session contribute in correct population proportion, so the blended p75 matches the unsampled population p75 within sampling error.
What is the difference between head-based and tail-based sampling here?
Head-based decides at session start from the session id and static context — cheap, stateless, and the only option for a pure client sampler, but blind to outcomes. Tail-based decides at session end once LCP/INP/CLS and errors are known, so it can preferentially keep the slow tail. Production blends them: a head-based base rate for a representative spine plus a tail rule that rescues Poor-threshold and error sessions. See Head-Based vs Tail-Based Sampling for RUM.
Will sampling make my Core Web Vitals look better than they really are?
Only if you sample badly. Biased sampling — event-level decisions, dropping flaky-network beacons, or forgetting to reweight — systematically removes slow sessions and flatters your numbers. Correctly implemented session-level sampling with query-time reweighting reproduces the unbiased p75; the safeguard is a 100%-kept control cohort whose percentiles you continuously diff against the sampled-and-reweighted result.
How low can I set the base keep rate?
Low enough that every reported cohort still clears a floor of a few hundred kept sessions per aggregation window. Below that, sampling variance swamps real signal and the p75 jitters into false regression alerts. Drop the base rate aggressively for dense desktop traffic, but raise the per-cohort rate (not the global one) whenever a sparse cohort’s p75 gets noisy.
Related
- Head-Based vs Tail-Based Sampling for RUM — when to decide keep/drop at session start vs session end, and how to blend both.
- Self-Hosted Beacon Collection — the ingestion endpoint where sampled beacons land and weights are persisted.
- Designing a BigQuery Schema for RUM Events — model
sampling_weightandsampling_cohortso reweighted queries stay cheap. - Grafana Dashboards for Web Performance — render weighted percentiles and per-cohort drift without re-introducing bias.
- User Impact Mapping — translate the cohorts you oversample into the user and revenue impact that justifies the extra capture.