Synthetic vs Field Data Trade-offs

Every performance team runs into the same disagreement: a Lighthouse score says the page is fast, the field data says a quarter of users are having a bad time, and someone has to decide which number is real. Both are measuring the same page, but they answer different questions — and treating one as a proxy for the other is the most common way performance work gets misdirected. This guide, part of Core Web Vitals & Performance Metrics Fundamentals, sets out exactly when lab/synthetic testing is authoritative, why field-collected p75 is the assessment of record, and how to wire the two together so they reinforce rather than contradict each other.

Synthetic (lab) testing runs the page under controlled, repeatable conditions: a fixed device profile, a throttled network, a cold or warm cache, no real user interacting. Field testing — Real-User Monitoring — captures what actually happened on the devices, networks, and pages of real visitors, then aggregates it. The headline number for a Core Web Vitals assessment is always the 75th percentile of field data, never a lab point. A lab run gives you one sample under one configuration; the field gives you a distribution across millions of configurations you could never enumerate by hand.

A lab run is one point near the fast end; the field is a distribution. The Good/Poor verdict rides on the p75, which lab testing structurally cannot see. See how p75 is computed from sampled field data.

Why p75 Field Data Is the Assessment of Record

Google’s Core Web Vitals program — and the search-ranking signal derived from it — evaluates a page using the 75th percentile of real-user measurements, segmented by mobile and desktop. That choice is deliberate: the p75 protects three of every four sessions while staying resistant to the extreme outliers that dominate a mean. A single broken session on a 2018 budget phone over a saturated cellular link can be 10× your median; an average would let that one session swing the whole report, while p75 absorbs it. The mechanics of computing that percentile from sampled beacons are covered in RUM data sampling strategies and p75 aggregation.

The thresholds the p75 is graded against are fixed by spec, and lab tooling reports against the same bands so the numbers are comparable even though the inputs differ.

Metric	Good (≤)	Needs Improvement (≤)	Poor (>)	Field p75 is the verdict for
Largest Contentful Paint	2.5 s	4.0 s	4.0 s	render of largest above-fold element
Interaction to Next Paint	200 ms	500 ms	500 ms	responsiveness across the whole visit
Cumulative Layout Shift	0.1	0.25	0.25	visual stability over the session
First Contentful Paint	1.8 s	3.0 s	3.0 s	perceived load start (diagnostic)
Time to First Byte	800 ms	1.8 s	1.8 s	server/CDN responsiveness (diagnostic)

The two interaction-and-stability metrics are where lab testing breaks down hardest: INP and CLS both depend on a real person scrolling, typing, and clicking over the life of a visit. A synthetic run with no interaction reports an INP that is structurally optimistic, and frequently reports no INP at all. This is why the field p75 from real interactions is the only defensible source of truth for those two.

The Variance Lab Testing Cannot Model

Synthetic testing fixes the variables that, in production, are exactly the variables that hurt you. A lab profile picks one device, one network shape, one cache state. The field is the Cartesian product of all of them, and the slow tail of that product is where p75 lives.

Device CPU spread. Lab usually emulates a mid-tier phone with a CPU-throttling multiplier. Real traffic spans flagship silicon to five-year-old budget chips whose main thread is 6–8× slower. INP and hydration-bound LCP scale almost linearly with CPU, so the field tail diverges from the lab point dramatically.
Network reality. Throttling presets (“Slow 4G”) are a single fixed RTT and bandwidth. Real connections have variable RTT, packet loss, captive portals, radio wake-up latency, and connection migration — none of which a token-bucket throttle reproduces.
Third-party contention. A clean lab run may block ads, consent dialogs, tag managers, and A/B-test scripts, or run them un-contended. In the field those scripts compete for the same main thread as your hydration, inflating INP and pushing back LCP. This is one of the largest and least predictable sources of lab/field divergence.
Cache and session state. Lab is typically cold-cache, first-visit. Real traffic is a mix of cold, warm, returning, and bfcache-restored sessions, each with a different curve.
Engagement timing. CLS and INP accumulate over the real dwell time of a visit. A lab run that loads and stops never sees the late-injected ad slot or the slow click that defines the p75.

You can shift a lab profile toward the slow tail (throttle harder, slower device emulation), but you are still producing one point. You cannot emulate a distribution, and the assessment is a property of the distribution.

When to Reach for Each: Decision Matrix

Synthetic and field testing are complementary tools with non-overlapping strengths. Use this matrix to decide which one answers the question in front of you.

Dimension	Synthetic / lab (Lighthouse, WebPageTest, DevTools, Lighthouse-CI)	Field / RUM (your beacons, CrUX)	Use when
Reproducibility	High — same input, same output	Low — every session differs	Lab: you need a stable number to diff
Pre-deploy coverage	Yes — runs in CI before merge	No — needs real traffic after release	Lab: gating a pull request
Ground truth on real users	No — emulated conditions	Yes — actual devices and networks	Field: deciding if users are actually OK
INP / CLS accuracy	Poor — no real interaction or dwell	High — captured over real visits	Field: any responsiveness/stability verdict
Attribution detail	High — full trace, call tree, waterfall	Partial — attribution build adds element/event detail	Lab: root-causing a specific regression
Ranking signal source	Not used by search	p75 of field data is the signal	Field: anything tied to search/CWV pass
Variance / outliers	None — single sample	Full distribution, slow tail visible	Field: understanding the p75 and beyond
Cost to run repeatedly	Low and on-demand	Continuous; needs ingestion + storage	Lab: ad-hoc checks; Field: ongoing SLO

The short rule: lab tells you whether a change you control made the page faster in a controlled setting; field tells you whether real users are having a good experience. Lab answers “did my fix work in isolation?” Field answers “are we passing?”

A Harness That Captures Both and Compares

To reconcile lab and field you need them in the same units and the same store. The browser snippet below captures the live field metrics with the web-vitals library on top of PerformanceObserver, tags each beacon with the device/network context that explains divergence, and posts it to your collector. A companion Node script reads a Lighthouse-CI synthetic result for the same URL and computes the lab-vs-field delta so the comparison is mechanical, not eyeballed.

// field.js — runs in the browser, reports real-user vitals with context
import { onLCP, onINP, onCLS, onFCP, onTTFB } from 'web-vitals';

const beacon = { url: '/rum/collect', batch: [] };

function context() {
  const nav = navigator;
  const conn = nav.connection || {};
  return {
    route: location.pathname,
    effectiveType: conn.effectiveType || 'unknown',   // '4g', '3g', ...
    downlink: conn.downlink ?? null,
    deviceMemory: nav.deviceMemory ?? null,            // GB, coarse
    cpuCores: nav.hardwareConcurrency ?? null,
    dpr: window.devicePixelRatio,
    saveData: conn.saveData ?? false,
    source: 'field'
  };
}

function record(metric) {
  beacon.batch.push({
    name: metric.name,            // 'LCP' | 'INP' | 'CLS' | 'FCP' | 'TTFB'
    value: metric.value,         // ms, except CLS (unitless)
    rating: metric.rating,       // 'good' | 'needs-improvement' | 'poor'
    id: metric.id,
    ...context()
  });
}

[onLCP, onINP, onCLS, onFCP, onTTFB].forEach((fn) => fn(record));

function flush() {
  if (!beacon.batch.length) return;
  const body = JSON.stringify(beacon.batch);
  navigator.sendBeacon(beacon.url, body);
  beacon.batch = [];
}

// Finalize on the only events guaranteed to fire as the page goes away.
addEventListener('visibilitychange', () => {
  if (document.visibilityState === 'hidden') flush();
});
addEventListener('pagehide', flush);

// compare.js — Node side: diff one synthetic run against the field p75
// Reads a Lighthouse-CI JSON result and a p75 row pulled from your RUM store.
import { readFileSync } from 'node:fs';

// Lighthouse stores LCP/CLS as audits; numericValue is in ms (CLS unitless).
function labMetrics(lhrPath) {
  const lhr = JSON.parse(readFileSync(lhrPath, 'utf8'));
  const a = lhr.audits;
  return {
    LCP: a['largest-contentful-paint'].numericValue,
    CLS: a['cumulative-layout-shift'].numericValue,
    FCP: a['first-contentful-paint'].numericValue,
    TTFB: a['server-response-time'].numericValue
    // INP is intentionally absent: a non-interactive lab run cannot measure it.
  };
}

// fieldP75: { LCP, INP, CLS, FCP, TTFB } already aggregated at p75 from RUM.
function compare(lhrPath, fieldP75) {
  const lab = labMetrics(lhrPath);
  const rows = [];
  for (const name of Object.keys(fieldP75)) {
    const labVal = lab[name];
    const fieldVal = fieldP75[name];
    if (labVal == null) {
      rows.push({ name, lab: 'n/a', field: fieldVal, note: 'lab cannot measure' });
      continue;
    }
    const ratio = fieldVal / labVal;
    rows.push({
      name,
      lab: Math.round(labVal),
      field: Math.round(fieldVal),
      ratio: Number(ratio.toFixed(2)),
      diverged: ratio > 1.5   // field is >50% worse than lab → investigate
    });
  }
  return rows;
}

const result = compare('./lhci/lhr.json', {
  LCP: 3200, INP: 280, CLS: 0.12, FCP: 2100, TTFB: 950
});
console.table(result);

The ratio column is the working signal. A ratio near 1.0 means the lab profile happens to sit near your field p75 — convenient, but coincidental. A ratio well above 1.5 is the normal, expected state for INP and CPU-bound LCP, and it is telling you the lab is testing a faster machine than your median user owns.

Reconciling Lab/Field Divergence: A Triage Workflow

When the lab says one thing and the field says another, the divergence is itself the most useful diagnostic you have — it points directly at a variable the lab is not modelling. Work the gap in this order.

Confirm you are comparing like for like. Lab is desktop-by-default in many configs; field p75 is reported split by mobile/desktop. Pull the matching form-factor field segment before comparing. A “regression” is often a desktop lab number held up against a mobile-heavy field p75.
Reproduce the regression in the lab first. If a field metric got worse, try to make a synthetic run reproduce it — slower CPU multiplier, harder network throttle, real third-party scripts unblocked. A reproducible lab regression is fixable with a fast inner loop; an irreproducible one means the cause lives in field-only variance.
Trace the waterfall on the offending run. Use the DevTools Performance panel or WebPageTest filmstrip to find the render-blocking resource, long task, or layout shift driving the metric. Confirm the attributed element matches what the attribution build reports from the field.
Segment the field data along the suspected axis. If you suspect device, group p75 by deviceMemory/hardwareConcurrency; if network, by effectiveType; if third parties, by whether consent was granted. The axis where p75 splits hardest is your cause.
Validate the candidate fix in the lab. Re-run Lighthouse-CI with the patch and confirm the controlled number moves the right direction. This is the fast, deterministic check before you spend real traffic on it.
Ship behind a flag and watch the field delta. Deploy, then watch the field p75 for the affected segment converge toward the lab improvement over the next traffic window. The field is the only place the fix is confirmed real.

The loop is deliberately asymmetric: you debug and gate in the lab because it is fast and deterministic, and you confirm in the field because that is where the verdict lives.

Field-Data Segmentation Patterns

A single global p75 hides the divergence that explains it. The same context fields the browser harness attaches are the dimensions you slice on after ingestion. The most productive cuts:

Device class. Bucket by deviceMemory and hardwareConcurrency into low/mid/high tiers. CPU-bound metrics — INP, hydration-driven LCP — separate sharply here, and the low tier is usually what pulls the global p75 above threshold.
Network type. Group by effectiveType and saveData. TTFB and LCP load phases dominate on 3G and constrained links; this is where lab throttle presets are least representative.
Geography / CDN PoP. Routing and edge-cache hit rate move TTFB by hundreds of milliseconds. A regional p75 split exposes a misrouted PoP that a single lab location never sees.
Route and template. p75 per route prevents one heavy template (a search results page, a product detail page) from being masked by light ones. Always aggregate per-route, then roll up.
Third-party state. Split by consent granted/denied or ad-slot present/absent to quantify exactly how much of your INP and CLS the third parties own.

The reason these segments matter is the same reason p75 matters: you are looking for the population whose experience drags the percentile, and that population is defined by one of these axes. Tying those segments back to conversion and revenue impact is what turns a slow segment into a funded fix.

CrUX: The Public Field Dataset

You don’t always have your own RUM, and even when you do, CrUX is the field dataset Google grades you on. The Chrome User Experience Report aggregates field measurements from opted-in Chrome users into a public p75 per origin and (for popular URLs) per page, updated on a 28-day trailing window. It is the source behind the Search Console Core Web Vitals report and PageSpeed Insights’ “Field Data” panel.

Treat CrUX and your own beacons as complementary, not redundant:

CrUX is the assessment of record for the ranking signal — it is literally the dataset the pass/fail verdict is computed from. Your own RUM can disagree slightly because it samples differently and includes non-Chrome traffic.
CrUX is coarse and lagged. The 28-day window means a fix takes weeks to fully reflect, and low-traffic URLs fall back to origin-level data. Your own RUM gives you same-day, per-route resolution.
CrUX has no custom dimensions. You cannot segment CrUX by your own user cohorts, logged-in state, or experiment arm. Your beacons can — which is the whole reason to self-host an ingestion pipeline alongside CrUX.

Use CrUX to know whether you pass; use your own field data to know why and to move fast. Use the lab to gate the change that closes the gap.

Failure Modes and Gotchas

Reading a Lighthouse score as a field verdict. The 0–100 Performance score is a weighted lab composite under one profile. It is not the Core Web Vitals assessment and does not feed ranking. Never report “we pass CWV” from a Lighthouse number.
Missing INP in the lab and assuming it’s fine. A non-interactive synthetic run reports no INP. Absence of an INP value is not a passing INP — it is no measurement at all.
Comparing cold lab to warm field. Lab is first-visit cold-cache by default; field mixes warm and bfcache restores. A lab LCP worse than field p75 is often just this cache mismatch, not a real problem.
Cross-form-factor comparison. Holding a desktop lab run against a mobile-dominant field p75 manufactures a divergence that isn’t there. Always match form factor.
Single lab run treated as stable. Lab numbers have run-to-run variance from CPU scheduling and network jitter. Take a median of several runs (Lighthouse-CI does this) before diffing.
Beacon loss skewing the field. If your harness flushes only on unload, mobile and Safari sessions drop their beacons and your p75 is biased toward sessions that stayed. Finalize on visibilitychange/pagehide, as in the harness above.
Sampling bias. If your RUM samples by request rather than by session, heavy sessions are over-represented. Confirm your sampling strategy preserves the p75 before trusting it against CrUX.

CI/CD Gating: Use the Lab Where It’s Strong

The lab’s reproducibility is exactly what a regression gate needs, so put Lighthouse-CI in the pipeline and assert on the metrics it measures honestly — LCP, FCP, TTFB, CLS, total blocking time — while leaving INP to the field. Gate on a budget, run several iterations to suppress jitter, and fail the build on regression.

# .github/workflows/perf.yml step
npm i -g @lhci/cli
lhci autorun \
  --collect.numberOfRuns=5 \
  --collect.url="https://staging.example.com/" \
  --assert.assertions.largest-contentful-paint="error:2500" \
  --assert.assertions.cumulative-layout-shift="error:0.1" \
  --assert.assertions.total-blocking-time="error:300"

Total Blocking Time is the right lab proxy to gate as a stand-in for INP risk: it correlates with main-thread congestion that hurts interactivity, and unlike INP it is measurable without a user. Gate TBT in CI to catch the regression early, then confirm the real INP p75 in the field after release. That split — lab budgets to stop regressions before merge, field p75 to confirm the user-facing verdict — is the entire reconciliation strategy in one sentence.

FAQ

Is a 100 Lighthouse score the same as passing Core Web Vitals?

No. The Lighthouse Performance score is a weighted composite of lab metrics measured under one emulated profile. The Core Web Vitals pass/fail verdict is computed from the 75th percentile of real-user field data (CrUX). A page can score 100 in the lab and still fail in the field, most often on INP, which a non-interactive lab run cannot measure.

Why is my field p75 worse than my Lighthouse result?

Because the lab tests one device and network profile, usually faster than your slowest 25% of users. Real traffic includes old CPUs, constrained networks, and un-blocked third-party scripts. The p75 captures that slow tail; the single lab point does not. A field/lab ratio above roughly 1.5 is normal for INP and CPU-bound LCP.

Can I gate INP in CI from a synthetic run?

Not directly — a lab run with no real interaction produces no INP. Gate Total Blocking Time as a proxy for main-thread congestion in CI, then confirm the actual INP p75 from field data after the change ships.

What is the difference between CrUX and my own RUM data?

CrUX is Google’s public field dataset of opted-in Chrome users, aggregated to p75 over a trailing 28-day window — it is the dataset the ranking signal uses. Your own RUM is same-day, per-route, includes non-Chrome traffic, and supports custom segmentation. Use CrUX to know if you pass; use your own data to know why and to act fast.

When should I trust the lab over the field?

When you need a reproducible number to compare against — debugging a regression, gating a pull request, or A/B-testing an optimization in isolation. For any verdict about whether real users are actually having a good experience, the field p75 wins.

Core Web Vitals & Performance Metrics Fundamentals — the parent overview tying every metric and measurement method together.
RUM Data Sampling Strategies — how sampled beacons become the p75 that is the assessment of record.
Mapping Core Web Vitals to Conversion Rates — turning a slow field segment into a funded business case.
Web Vitals API Implementation — the PerformanceObserver and web-vitals library setup behind the field harness.
RUM Architecture, Tooling & Self-Hosting — the ingestion and storage side that complements CrUX with your own field data.