OpenTelemetry for Web RUM

OpenTelemetry gives browser Real-User Monitoring a vendor-neutral spine: instead of bolting a proprietary RUM agent onto every page, you emit traces and spans from the same SDK family your backend already speaks, then route everything through a single Collector. This page covers the OpenTelemetry JavaScript web SDK as a field-data pipeline — generating spans for navigation, resource, and user-interaction timing, mapping Core Web Vitals onto spans and span events, exporting over OTLP HTTP/protobuf, and keeping volume sane through sampling and batching. It sits inside the broader RUM Architecture, Tooling & Self-Hosting domain, and pairs with the self-hosted beacon collection and RUM data sampling practices that govern how that telemetry is ingested and aggregated at p75.

The promise of OpenTelemetry for the browser is correlation: a slow Largest Contentful Paint span can carry the same trace_id as the backend request that fed it, so a single trace spans the wire from the user’s main thread down to your database. The cost is operational discipline — the web SDK is not “drop in a snippet.” You own the exporter transport, the resource attributes, the sampling decision, and the cardinality budget, or your Collector drowns.

OpenTelemetry web RUM pipeline The browser web SDK generates navigation, resource, interaction, and Core Web Vitals spans, batches them in a BatchSpanProcessor, exports OTLP over HTTP protobuf to a Collector that samples and strips PII, then writes to a trace store and a p75 aggregate. Browser SDK WebTracerProvider document-load spans interaction spans LCP / INP / CLS events batch Collector sampler + transform OTLP receiver OTLP / HTTP protobuf Trace store Tempo / ClickHouse p75 aggregate dashboards / alerts SDK-side sampling decides what leaves the browser. Collector-side sampling decides what reaches storage.
The web SDK batches spans and exports OTLP/protobuf to a Collector, which samples and writes to a trace store plus a p75 aggregate.

Mapping Core Web Vitals onto the OTel data model

OpenTelemetry’s data model was built for server traces — a span is a unit of work with a start, an end, and a parent. Browser RUM bends slightly differently, and the bend is where teams get it wrong. There is no official @opentelemetry/instrumentation-web-vitals package, and there should not be: a Core Web Vital is not a unit of work, it is a measurement that finalizes late in the page lifecycle. You model it as either a zero-or-short-duration span or, more correctly, a span event attached to the page’s navigation span.

The two viable patterns:

  • One span per vital. Start a span named web_vital, set web.vital.name, web.vital.value, and web.vital.rating attributes, end it immediately. Simple, queryable, but it inflates span count by three-plus per page view and detaches the vital from its navigation context.
  • Span events on the navigation span. Keep a reference to the document-load span and call span.addEvent('lcp', { value, rating }) as each vital reports. Lower cardinality, preserves correlation, but requires the navigation span to stay open until Interaction to Next Paint finalizes — which can be the entire session.

The rating values must match the current Google spec exactly, because downstream dashboards key off them and a drifted threshold silently miscategorises the field. The table below is the contract every exporter and Collector transform should honour.

Metric Good Needs Improvement Poor Span attribute on capture
LCP ≤ 2.5 s ≤ 4.0 s > 4.0 s web.vital.name=LCP, value in ms
INP ≤ 200 ms ≤ 500 ms > 500 ms web.vital.name=INP, value in ms
CLS ≤ 0.1 ≤ 0.25 > 0.25 web.vital.name=CLS, unitless score
FCP ≤ 1.8 s ≤ 3.0 s > 3.0 s web.vital.name=FCP, value in ms
TTFB ≤ 800 ms ≤ 1800 ms > 1800 ms web.vital.name=TTFB, value in ms

Engineering action follows the rating: a web.vital.rating=poor LCP span should carry enough resource-timing context to find the culprit, an INP poor rating should reference the interaction’s event.target and interactionId, and a CLS poor rating wants the shifting node’s bounding-box attributes. The vital’s value is the alert trigger; the span’s attribute payload is the debugging surface.

Resource attributes and the configuration surface

Every span the web SDK emits is stamped with a Resource — a set of attributes describing what produced the telemetry, as opposed to the span attributes describing what happened. Get this layer right once and your queries become trivial; get it wrong and you cannot segment a single dashboard. Resource attributes are low-cardinality by design (one set per page-view session), so they are the safe place for service.name, deployment.environment, and a build SHA. They are the wrong place for anything per-interaction.

The table below is the minimum production configuration surface for a browser web SDK, with the trade-off each knob controls.

Setting Where Typical value What it controls / failure if wrong
service.name Resource frontend-web-app Dashboard scoping; missing it makes spans unattributable
deployment.environment Resource production Splits prod from staging noise in one query
service.version Resource git SHA / release tag Regression attribution to a specific deploy
OTLPTraceExporter.url Exporter https://collect.example.com/v1/traces Where beacons land; CORS-gated cross-origin
exporter encoding Exporter HTTP/protobuf Payload size; protobuf is ~30% smaller than JSON
BatchSpanProcessor.maxQueueSize Processor 2048 Memory ceiling; overflow silently drops spans
BatchSpanProcessor.scheduledDelayMillis Processor 5000 Flush cadence vs. battery/network cost
BatchSpanProcessor.maxExportBatchSize Processor 512 Request size; must stay under Collector body limit
sampler Provider ParentBasedSampler(TraceIdRatioBased(0.1)) Volume at source; head-based sampling decision
interaction eventNames Instrumentation click,keydown,pointerdown Span cardinality; broad lists blow up volume

Encode resource attributes as a Resource passed to the WebTracerProvider, not as per-span attributes. The OTLP HTTP/protobuf exporter is the default production choice over HTTP/JSON: the wire payload is smaller, which matters on mobile, and the Collector parses it faster under load. JSON is useful only when debugging by eye against a Collector’s debug exporter.

Production instrumentation snippet

The following is a complete, runnable browser-side setup against the SDK 2.x API: a WebTracerProvider with resource attributes, a head-based sampler, a BatchSpanProcessor feeding the OTLP/protobuf exporter, document-load and user-interaction instrumentation, fetch context propagation, and Core Web Vitals captured via the web-vitals library onto span events. The deeper configuration walkthrough lives in Configuring OpenTelemetry for Frontend Performance, and the vitals-as-spans modelling in Exporting Web Vitals as OpenTelemetry Spans.

import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { resourceFromAttributes } from '@opentelemetry/resources';
import {
  BatchSpanProcessor,
  ParentBasedSampler,
  TraceIdRatioBasedSampler,
} from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
import { DocumentLoadInstrumentation } from '@opentelemetry/instrumentation-document-load';
import { UserInteractionInstrumentation } from '@opentelemetry/instrumentation-user-interaction';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { trace } from '@opentelemetry/api';
import { onLCP, onINP, onCLS, onFCP, onTTFB } from 'web-vitals';

const exporter = new OTLPTraceExporter({
  url: 'https://collect.example.com/v1/traces', // protobuf endpoint
  headers: {}, // CORS preflight must allow Content-Type on the collector
});

const provider = new WebTracerProvider({
  resource: resourceFromAttributes({
    'service.name': 'frontend-web-app',
    'service.version': window.__BUILD_SHA__ ?? 'dev',
    'deployment.environment': 'production',
  }),
  // Head-based sampling: keep 10% of traces, honour parent decision for joins.
  sampler: new ParentBasedSampler({
    root: new TraceIdRatioBasedSampler(0.1),
  }),
  spanProcessors: [
    new BatchSpanProcessor(exporter, {
      maxQueueSize: 2048,
      maxExportBatchSize: 512,
      scheduledDelayMillis: 5000,
    }),
  ],
});

provider.register();

registerInstrumentations({
  instrumentations: [
    new DocumentLoadInstrumentation(),
    new FetchInstrumentation({
      // Inject traceparent so frontend spans join backend traces.
      propagateTraceHeaderCorsUrls: [/https:\/\/api\.example\.com/],
    }),
    new UserInteractionInstrumentation({
      eventNames: ['click', 'keydown', 'pointerdown'],
      shouldPreventSpanCreation: (eventName, element) =>
        element.tagName === 'SCRIPT' || element.tagName === 'LINK',
    }),
  ],
});

// Attach Core Web Vitals as events on a long-lived page-view span.
const tracer = trace.getTracer('cwv', '1.0.0');
const pageView = tracer.startSpan('page_view', {
  attributes: { 'page.url': location.pathname },
});

function recordVital({ name, value, rating, navigationType }) {
  pageView.addEvent(name.toLowerCase(), {
    'web.vital.name': name,
    'web.vital.value': value,
    'web.vital.rating': rating,
    'web.vital.navigation_type': navigationType,
  });
}

onLCP(recordVital);
onINP(recordVital);
onCLS(recordVital);
onFCP(recordVital);
onTTFB(recordVital);

// Finalize on the terminal lifecycle event so late metrics (INP, CLS) land.
addEventListener('pagehide', () => {
  pageView.end();
  // shutdown() force-flushes the BatchSpanProcessor before teardown.
  provider.shutdown().catch(() => {});
}, { once: true });

Two lifecycle details carry the design. First, the page-view span is ended on pagehide, not beforeunloadbeforeunload is unreliable on mobile Safari and breaks the back/forward cache. Second, provider.shutdown() force-flushes the batch queue; without it, the last batch of spans dies with the tab. For payloads exceeding the 64 KB sendBeacon ceiling, the OTLP exporter falls back to fetch with keepalive, which has its own browser-specific size caps — keep batches small enough that a single flush stays under them.

Sampling: SDK versus Collector

Sampling is the single most consequential decision in a browser RUM pipeline, and OpenTelemetry forces you to choose where it happens. The two locations are not interchangeable, and the full reasoning lives in the RUM data sampling strategies reference.

SDK-side (head-based) sampling decides in the browser, before export, using TraceIdRatioBasedSampler. Its advantage is bandwidth: dropped traces never cross the network, never hit the Collector, never cost ingest. Its limitation is that the decision is made before you know whether the trace is interesting — you cannot keep “only the slow ones,” because the LCP value is not known when the trace starts. Wrap it in ParentBasedSampler so a backend-initiated trace’s sampling decision propagates, keeping frontend and backend halves of a joined trace consistent.

Collector-side sampling sees the whole picture. A probabilistic_sampler processor thins uniformly, while a tail_sampling processor can retain every trace whose LCP span exceeds 2.5 s and downsample the fast ones. The cost is that every span reaches the Collector first — you pay the network and ingest before you discard.

The production pattern for a p75-driven RUM dashboard is a low head-based rate to protect the wire, plus tail-based retention at the Collector to guarantee the slow tail survives — because p75 and worse are exactly the buckets you cannot afford to lose. The Collector configuration below pairs a probabilistic floor with PII stripping before storage.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins:
            - https://www.example.com

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  probabilistic_sampler:
    sampling_percentage: 50
  transform:
    trace_statements:
      - context: span
        statements:
          - delete_key(attributes, "user.email")
          - set(attributes["page.url"], SHA256(attributes["page.url"]))

exporters:
  clickhouse:
    endpoint: tcp://analytics-db:9000
    database: rum_metrics
    ttl: 720h

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [probabilistic_sampler, transform, batch]
      exporters: [clickhouse]

Debugging workflow

When a Core Web Vital regresses in the field, the OTel pipeline turns a vague “the site feels slow” into a trace you can step through. The workflow:

  1. Identify the outlier. Query the trace store for spans where web.vital.name = 'LCP' AND web.vital.value > 2500, grouped by service.version to pin it to a release.
  2. Trace the waterfall. Open one outlier trace and read the document-load and resource spans in order. The OTel resource-timing attributes (http.url, duration) expose whether the delay is TTFB on the wire or a render-blocking asset.
  3. Correlate overlaps. Because FetchInstrumentation injected traceparent, join the frontend span to the backend span by trace_id and see whether the slow segment is client-side or server-side.
  4. Validate in lab. Reproduce with the same URL under a throttled profile to confirm the field signal is real and not a one-off network event.
  5. Deploy the fix. Ship the change behind the same service.version stamping so the next query can measure the delta.
  6. Monitor the delta. Watch the p75 LCP for the new version against the old; a clean fix moves the p75, not just the mean.

The discipline that makes this work is consistent attribute naming. If half your spans say web.vital.value and half say webVitalValue, every query becomes a coalesce, and the debugging loop stalls on data hygiene instead of root cause.

Field-data analysis patterns

Aggregate RUM telemetry only becomes actionable when segmented. A single global p75 LCP hides the fact that your high-end-desktop p75 is 1.9 s and your low-end-Android-on-3G p75 is 5.4 s — averaging them produces a number that describes nobody. Stamp every page-view span with the dimensions you intend to segment on, then slice.

The three segmentation axes that earn their cardinality budget:

  • Device class — derive a coarse tier (high/mid/low) from navigator.hardwareConcurrency and deviceMemory, not the raw user-agent string. Low-tier devices dominate the poor tail of INP and LCP.
  • Network typenavigator.connection.effectiveType (4g/3g/slow-2g) explains most TTFB and LCP divergence. Sample it once per page view.
  • Geography — derive region at the Collector from the request IP, never ship raw IP in a span attribute (a privacy violation and a cardinality bomb).
-- p75 LCP by device tier and region, last 7 days
SELECT
  span_attributes['device.tier']  AS device_tier,
  resource_attributes['geo.region'] AS geo_region,
  quantile(0.75)(toFloat64OrZero(span_attributes['web.vital.value'])) AS p75_lcp_ms,
  count() AS page_views
FROM otel_traces
WHERE timestamp >= now() - INTERVAL 7 DAY
  AND span_attributes['web.vital.name'] = 'LCP'
GROUP BY device_tier, geo_region
ORDER BY p75_lcp_ms DESC;

The divergences worth alerting on are not absolute values but gaps: when low-tier-device p75 LCP pulls away from high-tier by more than a fixed margin release-over-release, a change has disproportionately hurt the constrained segment, even if the global p75 looks flat. That gap is invisible without per-segment attributes on every span.

Failure modes and gotchas

The web SDK has a set of recurring production failures that have nothing to do with your application code.

  • CORS to the Collector. The browser sends an OPTIONS preflight to the OTLP endpoint, and if the Collector’s cors.allowed_origins does not list your site origin, every export fails silently — spans batch, flush, and 403 into the void. Confirm the preflight passes before trusting any dashboard. This is the single most common “no data” cause.
  • Clock skew. Span timestamps come from the client’s wall clock, which can be minutes off. A device with a wrong clock produces spans that land in the wrong time bucket or appear to start before their parent. Where ordering matters, prefer performance.now()-derived durations over absolute timestamps, and let the Collector treat client time as advisory.
  • Span cardinality blowup. A UserInteractionInstrumentation with a broad eventNames list (adding mousemove, scroll, pointermove) generates thousands of spans per session. Combined with a high-cardinality attribute like a raw URL or an interaction target selector, this explodes both browser memory and Collector ingest cost. Keep eventNames tight and never put unbounded values in attributes.
  • SPA route transitions. Document-load instrumentation fires once on hard navigation; client-side route changes produce no document-load span. For an SPA you must manually open a span on route change and re-arm the web-vitals listeners, or every metric after the first navigation goes uncaptured.
  • Safari PerformanceObserver gaps. Safari’s PerformanceObserver support for some entry types lags Chromium, so certain document-load attributes are absent on WebKit. Treat missing attributes as expected, not as data loss, and segment Safari separately when comparing.
  • Background-tab suspension. A backgrounded tab can be frozen before the BatchSpanProcessor flushes, losing the queued batch. The pagehide flush in the snippet above is the mitigation; relying on scheduledDelayMillis alone loses the tail.

CI/CD integration

Field telemetry is a lagging indicator — by the time a regression shows in p75 it already shipped. Gate the metric earlier by running the same web SDK in your synthetic CI step against a Collector with a debug exporter, then asserting on the captured spans. A lab run that produces an LCP span above 2.5 s, or that produces zero web_vital events (meaning the instrumentation silently broke), should fail the build.

#!/usr/bin/env bash
# ci-otel-gate.sh — fail the build on a regressed or missing CWV span.
set -euo pipefail

# Run the synthetic page through a headless browser that exports to a local
# collector writing JSON lines to /tmp/spans.jsonl, then assert.
node ./scripts/run-synthetic-pageview.mjs --out /tmp/spans.jsonl

LCP_MS=$(jq -r 'select(.name=="lcp") | .attributes["web.vital.value"]' /tmp/spans.jsonl | sort -n | tail -1)

if [ -z "${LCP_MS:-}" ]; then
  echo "FAIL: no LCP span captured — OTel instrumentation is broken"
  exit 1
fi

# Threshold gate: 2500 ms is the Good/NI boundary for LCP.
if [ "$(printf '%.0f' "$LCP_MS")" -gt 2500 ]; then
  echo "FAIL: LCP span ${LCP_MS}ms exceeds 2500ms budget"
  exit 1
fi

echo "PASS: LCP span ${LCP_MS}ms within budget"

The “missing span” assertion matters as much as the threshold: an instrumentation that exports nothing passes a naive threshold check trivially. Gate on presence first, value second.

FAQ

Should Core Web Vitals be spans or span events in OpenTelemetry?

Prefer span events attached to a long-lived page-view span. A vital is a measurement that finalizes late in the lifecycle, not a unit of work, so modelling it as a standalone span inflates count and detaches it from navigation context. A span event keeps the vital correlated with its trace at lower cardinality. Use standalone web_vital spans only when your query layer cannot read events efficiently.

Is there an official OpenTelemetry web-vitals instrumentation package?

No. There is no @opentelemetry/instrumentation-web-vitals. Capture vitals with the web-vitals library and record them as span events or attributes yourself. This is intentional — the official browser instrumentations cover document-load, fetch/XHR, and user interaction, leaving CWV modelling to you.

Should I sample in the SDK or in the Collector?

Both. Use a low head-based TraceIdRatioBasedSampler in the SDK to protect bandwidth, wrapped in ParentBasedSampler so joined backend traces stay consistent. Then use tail-based sampling at the Collector to guarantee the slow p75-and-worse tail survives, since head-based sampling cannot keep “only the slow ones” — the LCP value is unknown when the trace starts.

Why do my browser spans never reach the Collector?

The most common cause is CORS: the browser’s OPTIONS preflight to the OTLP endpoint fails because the Collector’s cors.allowed_origins does not list your site origin, and spans 403 silently. The second most common is a missing pagehide flush, which loses the final batch when the tab closes. Verify the preflight succeeds and that provider.shutdown() runs on the terminal lifecycle event.

How do I keep span cardinality from exploding?

Keep UserInteractionInstrumentation.eventNames tight — click, keydown, pointerdown only, never mousemove or scroll. Never place unbounded values (raw URLs, full CSS selectors, IP addresses) in span attributes; hash or bucket them. Put low-cardinality context in resource attributes and reserve span attributes for genuinely per-event data.