Setting up a self-hosted RUM pipeline with ClickHouse
Setting up a self-hosted RUM pipeline with ClickHouse requires precise control over ingestion endpoints, payload routing, and columnar storage topology. Transitioning from vendor-managed telemetry to an in-house observability stack demands deterministic schema evolution and low-latency edge routing. This guide details the architecture, schema design, ingestion optimization, and statistical query patterns required to capture, store, and analyze Real-User Monitoring (RUM) & Core Web Vitals Tracking data at scale.
Architectural Foundations & Data Flow
The pipeline begins at the browser edge, leveraging the W3C Navigation Timing API and PerformanceObserver to capture first-contentful-paint, largest-contentful-paint, cumulative-layout-shift, and interaction-to-next-paint metrics. Data is serialized into compact JSON payloads and dispatched via navigator.sendBeacon(), which guarantees delivery even during page unload events.
Critical Constraints & Routing Logic:
- Payload Size: Strictly enforce
< 64KBper beacon to bypass browser queue limits and avoid truncation. - Retry Fallback: Implement exponential backoff with
fetch()as a fallback ifsendBeacon()fails due to quota exhaustion. - Edge Routing: Deploy a reverse proxy at the network edge to terminate TLS, strip headers, and forward payloads to the ingestion cluster over HTTP/2.
flowchart LR
A[Browser PerformanceObserver] -->|JSON Payload| B(navigator.sendBeacon)
B --> C[Edge Reverse Proxy]
C -->|TLS Termination & PII Stripping| D[Ingestion Batcher]
D -->|Batched Inserts| E[(ClickHouse MergeTree)]
E --> F[Grafana / BI Layer]
When evaluating foundational infrastructure decisions for RUM Architecture, Tooling & Self-Hosting, engineers must map edge-to-origin latency to ensure beacon dispatch does not block main-thread execution.
ClickHouse Table Schema & Storage Engine Configuration
High-throughput web metrics require columnar storage optimized for time-series aggregation. A robust schema uses the MergeTree engine with explicit partitioning, deterministic sorting keys, and aggressive compression.
CREATE TABLE rum_events
(
`event_id` UUID,
`timestamp` DateTime64(3),
`project_id` LowCardinality(String),
`session_id` LowCardinality(String),
`page_url` String,
`device_tier` LowCardinality(String),
`browser` LowCardinality(String),
`os` LowCardinality(String),
`connection_type` LowCardinality(String),
`lcp_ms` UInt32 CODEC(ZSTD(1)),
`cls_score` Float32 CODEC(ZSTD(1)),
`inp_ms` UInt32 CODEC(ZSTD(1)),
`fcp_ms` UInt32 CODEC(ZSTD(1)),
`raw_payload` String DEFAULT ''
)
ENGINE = MergeTree()
PARTITION BY toDate(timestamp)
ORDER BY (project_id, device_tier, timestamp)
TTL timestamp + INTERVAL 90 DAY DELETE
SETTINGS index_granularity = 8192;
Storage Optimizations:
LowCardinality(String): Applied to categorical dimensions (browser, OS, connection type) to reduce disk footprint by 60-80% while maintaining fast dimensional slicing.CODEC(ZSTD(1)): Balances compression ratio and CPU overhead for numeric metrics.TTL: Automatically purges raw payloads older than 90 days to control storage costs.- Materialized Columns: Derive CWV pass/fail flags at insert time using
MATERIALIZEDcolumns to avoid runtime computation overhead during aggregation.
Ingestion Optimization & Write Amplification Control
Raw beacon streams frequently exceed baseline ingestion capacity during traffic spikes. Direct, high-frequency INSERT statements cause excessive part fragmentation and degrade MergeTree background compaction. Implement server-side batching and deterministic sampling at the collector layer.
Nginx Ingress Configuration:
http {
client_max_body_size 64k;
client_body_buffer_size 16k;
upstream clickhouse_ingest {
server 127.0.0.1:8123;
keepalive 32;
}
server {
listen 443 ssl;
location /rum/collect {
proxy_pass http://clickhouse_ingest;
proxy_http_version 1.1;
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 8 4k;
proxy_connect_timeout 5s;
proxy_read_timeout 10s;
}
}
}
Batching & Sampling Strategy:
- Batch Window: Accumulate beacons for
500msor100 rows, whichever triggers first. Flush viaINSERT INTO rum_events FORMAT JSONEachRow. - Deterministic Sampling: Apply hash-based sampling to maintain statistical parity without biasing p75/p90 distributions.
-- Collector-side filter (20% sample)
WHERE cityHash64(session_id) % 100 < 20
Aligning ingestion throughput with established Self-Hosted Beacon Collection strategies prevents write amplification while preserving metric fidelity.
Core Web Vitals Aggregation & Percentile Queries
Accurate CWV reporting requires distribution-aware metrics. Averages obscure tail latency and layout instability. ClickHouse’s quantileTiming and quantileExact functions compute precise percentiles aligned with Google’s field data thresholds.
Percentile SQL Templates:
SELECT
toDate(timestamp) AS day,
quantileTiming(0.75)(lcp_ms) AS p75_lcp,
quantileExact(0.90)(cls_score) AS p90_cls,
quantileTiming(0.75)(inp_ms) AS p75_inp,
count() AS sessions
FROM rum_events
WHERE timestamp >= now() - INTERVAL 7 DAY
AND project_id = 'prod-frontend'
GROUP BY day
ORDER BY day DESC;
Indexing & Execution Optimization:
- Primary Sort Key:
(project_id, device_tier, timestamp)enables rapid prefix filtering for dashboard queries. - Skipping Indexes: Accelerate JSON path extraction if storing nested payloads:
ALTER TABLE rum_events ADD INDEX idx_payload_type payload.type TYPE minmax GRANULARITY 4;
- Execution Verification: Run
EXPLAIN indexes = 1, actions = 1before deploying queries to production. EnsureReadFromStoragesteps showFilterandSortoperations utilizing the primary key.
Privacy Enforcement & Dimensional Segmentation
Self-hosted telemetry must enforce strict data minimization and regional compliance boundaries. PII stripping and IP anonymization should occur at the reverse proxy before persistence.
Implementation Controls:
- IP Truncation: Hash and truncate IPs at ingestion using
IPv6StringToNum()and bitwise masking:
IPv6StringToNum(client_ip) & toIPv6('ffff:ffff:ffff:ffff:0000:0000:0000:0000')
- Session Stitching: Replace third-party cookies with ephemeral, first-party tokens rotated every 24 hours. Bind sessions to
navigator.hardwareConcurrencyandscreen.widthfor device fingerprinting without cross-site tracking. - GeoIP & Device Tier Mapping: Load external dictionaries for fast dimensional joins:
CREATE DICTIONARY geo_mapping
(
`ip_prefix` UInt32,
`country_code` String,
`region` String
)
PRIMARY KEY ip_prefix
SOURCE(CLICKHOUSE(TABLE 'ip_geo_blocks' DB 'default'))
LIFETIME(MIN 300 MAX 3600)
LAYOUT(HASHED_ARRAY());
Sanitized datasets enable granular Device Tier Analysis and Geographic Performance Breakdowns without violating GDPR/CCPA data residency requirements.
Pipeline Validation & Debugging Checklist
Operational stability requires continuous validation of ingestion health, query accuracy, and storage compaction. Use the following triage workflow to diagnose pipeline degradation.
| Symptom | Diagnostic Query / Action | Resolution |
|---|---|---|
| Slow Aggregations | SELECT query, elapsed FROM system.query_log WHERE type='QueryFinish' AND elapsed > 5 ORDER BY elapsed DESC LIMIT 10; |
Add missing WHERE predicates, verify ORDER BY aligns with sort key, or increase max_threads. |
| Merge Backlog | SELECT database, table, parts, merges FROM system.tables WHERE database = 'default' AND table = 'rum_events'; |
Increase background_pool_size, verify disk IOPS, or reduce insert_quorum. |
| Dropped Beacons | Monitor Nginx 413 Request Entity Too Large and 499 Client Closed Request logs. |
Increase client_max_body_size, optimize payload compression, or verify CORS preflight headers. |
| Compression Drift | SELECT table, formatReadableSize(data_uncompressed_bytes), formatReadableSize(data_compressed_bytes) FROM system.parts WHERE table = 'rum_events'; |
Verify CODEC settings, force OPTIMIZE TABLE FINAL during maintenance windows. |
Validation Workflow:
- Synthetic Injection: Simulate beacon payloads via
curl -X POST -H "Content-Type: application/json" -d '{"timestamp": "...", "lcp_ms": 1200}' https://edge.example.com/rum/collect - Parity Check: Compare field telemetry distributions against synthetic lab runs (Lighthouse/WebPageTest) to confirm pipeline fidelity.
- Index Verification: Run
EXPLAINon production dashboard queries to confirmReadFromMergeTreeutilizes primary key filtering and skipping indexes.