SpeedCurve vs Custom RUM
Choosing between a managed Real-User Monitoring vendor and a self-hosted stack is one of the higher-leverage architecture decisions a performance team makes, because it locks in your cost curve, your data residency posture, and how fast you can ship custom metrics for years. This page treats the question as a build-vs-buy decision within RUM Architecture, Tooling & Self-Hosting, using SpeedCurve as the managed exemplar against a self-hosted pipeline, and gives you a dimension-by-dimension decision matrix, a total-cost-of-ownership worked example, and a concrete migration path for the day the managed bill outgrows the value.
The honest framing: managed RUM buys you time-to-value and zero operational burden; self-hosting buys you data ownership, unbounded cardinality, and a fixed-ish cost at high volume. Most teams should start managed and only move when a specific constraint — cost at scale, a residency mandate, or a custom-metric ceiling — forces the issue. The matrix below makes those constraints explicit so the decision is evidence-driven rather than ideological.
The decision matrix
The core of any build-vs-buy evaluation is a dimension-by-dimension matrix. Each row names a force that pushes a team toward managed or self-hosted, with a concrete “use when” trigger so the matrix doubles as a checklist. SpeedCurve stands in here for the managed category broadly; the same axes apply to any vendor you assess in the wider RUM Vendor Comparison.
| Dimension | SpeedCurve / managed | Self-hosted stack | Use managed when… |
|---|---|---|---|
| Cost at scale | Billed per session/page view; predictable until volume spikes, then steeply superlinear | Mostly fixed infra (compute + columnar storage); marginal cost per beacon approaches zero | Traffic is modest or bursty and you cannot staff ops |
| Data ownership / residency | Vendor controls routing and region; residency is a plan/contract feature | You pin the region, encryption keys, and PII lifecycle end-to-end | No regulatory residency mandate exists |
| Time-to-value | Hours: paste a snippet, dashboards populate same day | Weeks: ingestion endpoint, schema, aggregation, dashboards all built | You need signal this quarter, not this half |
| Custom metrics | Bounded by the vendor metric dictionary and custom-data limits | Arbitrary key-value attributes, trace IDs, business dimensions | Standard Core Web Vitals plus a handful of marks suffice |
| Retention | Plan-tiered, often 6–13 months; longer costs more | You set hot/warm/cold tiers; multi-year archive is cheap on object storage | You do not need year-over-year field-data comparison |
| Alerting | Built-in budget alerts and regression detection out of the box | You wire Grafana/Alertmanager rules against your own percentiles | You lack capacity to own alerting infrastructure |
| Engineering burden | Near-zero; vendor owns uptime, scaling, upgrades | Continuous: you own ingestion uptime, schema migrations, query tuning | Your team is small and performance is one of many duties |
Read the matrix as a scorecard, not a verdict. If most rows resolve to “use managed,” stay managed. If three or more rows hit a hard constraint — a residency mandate, a cost cliff, a custom-metric ceiling — the engineering burden of self-hosting becomes the cheaper path. The two rows that flip the decision most often in practice are cost at scale and data residency, and those are exactly the two the TCO example below quantifies.
Threshold configuration both sides must honor
Whichever side you land on, the platform has to score Core Web Vitals against the current Google thresholds at the same percentile, or the comparison is meaningless. Configure budgets and alert rules to these bands. SpeedCurve ships them as defaults; a self-hosted stack encodes them in your alerting queries.
| Metric | Good (p75) | Needs Improvement (p75) | Poor (p75) | Engineering action on regression |
|---|---|---|---|---|
| LCP | ≤ 2.5 s | ≤ 4.0 s | > 4.0 s | Audit hero resource priority, preload, server TTFB contribution |
| INP | ≤ 200 ms | ≤ 500 ms | > 500 ms | Break up long tasks, yield to the scheduler, defer third parties |
| CLS | ≤ 0.1 | ≤ 0.25 | > 0.25 | Reserve space with aspect-ratio, audit dynamic injections |
| FCP | ≤ 1.8 s | ≤ 3.0 s | > 3.0 s | Cut render-blocking CSS/JS, inline critical styles |
| TTFB | ≤ 800 ms | ≤ 1.8 s | > 1.8 s | Add edge caching, shrink origin work, fix cold starts |
The non-negotiable detail: aggregate at p75, never at the mean. A managed dashboard that headlines an average is hiding the tail-end users who actually fail the threshold. On the self-hosted side you make this explicit in the aggregation query; on the managed side you confirm the dashboard’s primary stat is the 75th percentile before you trust any green badge.
Measurement implementation on each side
The managed path is a snippet drop. SpeedCurve’s RUM bundle wraps the web-vitals library and ships beacons for you; your only job is to confirm it captures the lifecycle-final value. The self-hosted path means you own that capture. Either way, the canonical capture uses the web-vitals library and PerformanceObserver with a lifecycle-safe flush so the final INP and CLS values survive tab close.
import { onLCP, onINP, onCLS, onFCP, onTTFB } from 'web-vitals';
// One queue, flushed once at end-of-life. Works for both a managed
// vendor endpoint and your own self-hosted ingestion endpoint.
const queue = new Set();
const ENDPOINT = '/rum/beacon'; // swap for vendor URL when managed
function report(metric) {
queue.add({
name: metric.name,
value: metric.value, // p75-eligible final value
id: metric.id,
rating: metric.rating, // 'good' | 'needs-improvement' | 'poor'
nav: metric.navigationType,
path: location.pathname,
ua: navigator.userAgent,
ts: Date.now(),
});
}
[onLCP, onINP, onCLS, onFCP, onTTFB].forEach((fn) =>
fn(report, { reportAllChanges: false })
);
function flush() {
if (!queue.size) return;
const body = JSON.stringify([...queue]);
queue.clear();
// sendBeacon survives unload; falls back to keepalive fetch
if (!navigator.sendBeacon(ENDPOINT, body)) {
fetch(ENDPOINT, { body, method: 'POST', keepalive: true });
}
}
// INP/CLS keep changing until the page is hidden — finalize then.
addEventListener('visibilitychange', () => {
if (document.visibilityState === 'hidden') flush();
});
addEventListener('pagehide', flush);
The only line that differs between managed and self-hosted is ENDPOINT. That symmetry is what makes the migration path below cheap: you can dual-write to both targets from the same capture code. On the self-hosted side the endpoint feeds a self-hosted beacon collection pipeline that validates, strips PII, and writes to columnar storage.
-- Self-hosted aggregation: p75 per day/country/device, computed
-- incrementally so it stays correct as new beacons arrive.
CREATE MATERIALIZED VIEW cwv_daily_p75
ENGINE = AggregatingMergeTree()
ORDER BY (event_date, country_code, device_tier)
AS SELECT
toDate(ts) AS event_date,
geo_country AS country_code,
device_class AS device_tier,
quantileState(0.75)(lcp_ms) AS lcp_p75_state,
quantileState(0.75)(inp_ms) AS inp_p75_state,
quantileState(0.75)(cls_score) AS cls_p75_state,
count() AS sessions
FROM rum_beacon_raw
WHERE lcp_ms > 0
GROUP BY event_date, country_code, device_tier;
-- Read back: merge the partial states into final p75 numbers.
SELECT
event_date,
country_code,
device_tier,
round(quantileMerge(0.75)(lcp_p75_state)) AS lcp_p75,
round(quantileMerge(0.75)(inp_p75_state)) AS inp_p75,
sum(sessions) AS sessions
FROM cwv_daily_p75
WHERE event_date >= today() - 7
GROUP BY event_date, country_code, device_tier
ORDER BY event_date DESC, lcp_p75 DESC;
A TCO worked example
Cost at scale is the row that flips most decisions, so quantify it. Assume a site at 50 million page views/month with a 25% RUM sampling rate (12.5M sampled sessions/month). Managed vendors bill on billed sessions or beacons; self-hosted bills on compute and storage. The numbers below are illustrative round figures for modeling, not a quote — plug your own rates in.
| Cost component | Managed (SpeedCurve-class) | Self-hosted (ClickHouse on cloud) |
|---|---|---|
| Per-volume fee | ~$0.30 per 1k billed sessions × 12.5M = ~$3,750/mo | $0 marginal per beacon |
| Compute | included | 3× mid-tier nodes ≈ $900/mo |
| Storage (raw + aggregates, 13-mo retention) | included | object + block storage ≈ $350/mo |
| Ingestion endpoint / edge | included | workers/CDN ≈ $120/mo |
| Engineering carry | ~0.1 FTE oversight | ~0.4 FTE ≈ $6,000/mo loaded |
| Effective monthly total | ~$3,750 | ~$7,370 (mostly labor) |
At 50M page views the managed bill is lower once you load in the engineering carry — self-hosting is dominated by the 0.4 FTE, not the infra. The crossover happens when volume climbs: the managed per-session fee scales linearly with traffic while the self-hosted infra grows sublinearly and the FTE stays roughly fixed. Re-run the same model at 500M page views/month and the managed per-volume fee balloons to ~$37,500/mo while self-hosted infra rises to perhaps ~$3,000/mo with the same 0.4 FTE — self-hosting wins by a wide margin. The lesson: TCO is a function of volume, and the right answer at 50M is the wrong answer at 500M.
The break-even is roughly where managed_per_session_fee × billed_sessions exceeds fixed_infra + loaded_FTE_cost. Compute your own crossover before committing — most teams discover they are years away from it, which is itself a finding.
Migration path: managed to self-hosted
When the TCO crossover arrives, you migrate without a flag day by running both pipelines in parallel and cutting over only once the numbers reconcile.
- Stand up the self-hosted endpoint behind a second beacon target. Deploy the ingestion endpoint, schema, and the
cwv_daily_p75view, but send it no production traffic yet. - Dual-write from the capture layer. Change
flush()tosendBeaconto both the vendor URL and your endpoint. This is a one-line fan-out; both receive identical lifecycle-final values, so the datasets are directly comparable. - Reconcile p75 across both systems for two to four weeks. Compare your ClickHouse p75 against the vendor dashboard per segment. A persistent gap means a sampling or attribution mismatch — not random noise — and must be root-caused before cutover.
- Rebuild alerting on the self-hosted side. Port every budget alert and regression rule to your own queries; verify they fire on a synthetic regression.
- Cut dashboards over, keep the vendor warm. Point teams at the self-hosted dashboards while the vendor still receives beacons as a safety net.
- Decommission the vendor beacon. Once a full release cycle passes with self-hosted parity, drop the vendor URL from the fan-out and downgrade or cancel the plan.
The dual-write window is the whole trick: it turns a risky rip-and-replace into a measured reconciliation, and it gives you a defensible artifact (two systems agreeing at p75) to justify the cutover to stakeholders.
Debugging and evaluation workflow
Whether you are evaluating a vendor trial or debugging your own stack, the workflow is the same: prove the platform reports what the browser actually experienced.
- Identify the regressed segment. Filter p75 by metric and find the segment that moved — e.g. INP p75 up 80 ms on Android in one region.
- Trace the waterfall. On the self-hosted side, join
longtaskandeventtiming entries; in a managed tool, open its session waterfall for the affected segment. - Correlate overlaps. Map the regression onto a deploy, a third-party script change, or a CDN cache-hit drop (
X-Cache-Status). - Validate in the lab. Reproduce with a throttled profile so you have a controllable signal, not just field aggregates.
- Deploy the fix behind your normal release process.
- Monitor the delta. Watch p75 in the affected segment over the next 24–72 hours; field data lags, so do not declare victory on hour one.
When evaluating SpeedCurve or any managed tool, run step 2 as an acceptance test: feed it a known synthetic regression and confirm its waterfall and attribution surface the cause. A tool that cannot trace a regression you induced will not trace the ones you did not.
Segmentation that exposes the real distribution
Headline p75 hides as much as it reveals. Segment on both sides identically so a vendor and your own stack can be compared.
| Segment axis | Why it matters | Divergence to watch |
|---|---|---|
Device class (navigator.hardwareConcurrency, deviceMemory) |
Low-end CPUs dominate INP regressions | Mobile p75 far above desktop signals main-thread cost |
| Network type (Effective Connection Type) | TTFB and LCP track connection quality | 4G/3G p75 spread reveals payload-weight problems |
| Geography (region / ASN) | Edge coverage and origin distance drive TTFB | One region’s TTFB p75 spiking points at a cache miss |
| Navigation type | Hard nav vs SPA route change behave differently | SPA INP divergence isolates client-side routing cost |
The single most useful divergence is mobile-vs-desktop INP p75. A wide gap is almost always main-thread work, and it points straight at long-task and scheduler fixes covered on the INP tracking and debugging side.
Failure modes and gotchas
- Vendor lock-in. Once dashboards, alerts, and team muscle-memory live in a vendor UI, leaving is expensive even when TCO says you should. Mitigate by exporting raw beacons to your own store from day one, even on a managed plan, so you always retain the data.
- Sampling opacity. If a vendor does not document its sampling rate and methodology, its p75 is not comparable to your full-fidelity self-hosted p75. Pin down the sampling strategy on both sides before trusting any cross-system delta.
- PII residency drift. A managed vendor may route beacons through regions your compliance posture forbids; full URLs with query strings can leak user IDs. Strip or hash PII at the edge before it leaves your boundary, and confirm the vendor’s data-flow region in writing.
- Attribution gaps in Safari. PerformanceObserver coverage differs across browsers; a metric missing from Safari field data is a capture gap, not a healthy population. Verify both platforms report the same metric set.
- Background-tab suspension. Beacons that wait for
unloadget dropped when a tab is frozen. Thevisibilitychange/pagehideflush above is mandatory on both sides; a vendor snippet that flushes onunloadonly will under-count.
CI/CD gating
Gate either platform the same way: query p75 for the freshest day and fail the build when it breaches a budget. Self-hosted hits your warehouse directly; managed hits the vendor’s API.
#!/usr/bin/env bash
set -euo pipefail
# Self-hosted: query ClickHouse for today's LCP p75; gate at 2.5s Good band.
LCP=$(curl -s 'http://clickhouse.internal:8123/' \
--data-urlencode "query=SELECT round(quantileMerge(0.75)(lcp_p75_state)/1000, 3)
FROM cwv_daily_p75 WHERE event_date = today()" )
LCP=${LCP:-0}
if awk "BEGIN { exit !($LCP > 2.5) }"; then
echo "::error::LCP p75 = ${LCP}s exceeds 2.5s Good budget — blocking deploy."
exit 1
fi
echo "LCP p75 = ${LCP}s within budget."
For a managed tool, swap the curl target for its budgets API; SpeedCurve exposes performance budgets as a first-class CI signal, which is one of the conveniences you pay for. The gating logic — compare p75 to the Good band, fail on breach — is identical regardless of where the number comes from.
FAQ
Is SpeedCurve cheaper than self-hosted RUM?
At low to moderate traffic, yes — once you load in the engineering carry, a managed plan often costs less than a self-hosted stack dominated by a fractional FTE. The crossover arrives as volume grows, because the managed per-session fee scales with traffic while self-hosted infra grows sublinearly. Run the TCO model at your real volume before deciding.
When should we move from managed RUM to self-hosting?
Move when three or more matrix rows hit a hard constraint: a data-residency mandate, a cost cliff at your volume, a custom-metric ceiling, or a retention horizon longer than any vendor plan. Any one constraint alone rarely justifies the operational burden; an accumulation does.
Can I compare SpeedCurve’s p75 to my own ClickHouse p75 directly?
Only if both use the same sampling rate and the same lifecycle-final metric values. Differences in sampling methodology or attribution will produce a persistent gap that is not noise. Reconcile during a dual-write window and root-cause any divergence before trusting either number.
Does self-hosting RUM mean giving up alerting and budgets?
No, but you build them. You wire Grafana or Alertmanager rules against your own percentiles and port every vendor budget into a CI query. Managed tools ship these out of the box; self-hosting trades that convenience for unbounded flexibility.
How do I avoid vendor lock-in while on a managed plan?
Export raw beacons to your own object store from day one, even while the vendor remains your primary dashboard. Retaining the underlying data means a future migration is a reconciliation exercise rather than a cold start.
Related
- RUM Vendor Comparison — the broader managed-vendor evaluation framework these axes feed into.
- Datadog vs New Relic vs Self-Hosted RUM — head-to-head on two more managed exemplars against self-hosting.
- Self-Hosted Beacon Collection — the ingestion pipeline that backs the self-hosted side of this decision.
- RUM Data Sampling Strategies — how to keep p75 comparable across managed and self-hosted systems.
- Grafana Dashboards for Web Performance — building the dashboards that replace a vendor UI after migration.