Setting up a self-hosted RUM pipeline with ClickHouse

Setting up a self-hosted RUM pipeline with ClickHouse requires precise control over ingestion endpoints, payload routing, and columnar storage topology. Transitioning from vendor-managed telemetry to an in-house observability stack demands deterministic schema evolution and low-latency edge routing. This guide details the architecture, schema design, ingestion optimization, and statistical query patterns required to capture, store, and analyze Real-User Monitoring (RUM) & Core Web Vitals Tracking data at scale.

Architectural Foundations & Data Flow

The pipeline begins at the browser edge, leveraging the W3C Navigation Timing API and PerformanceObserver to capture first-contentful-paint, largest-contentful-paint, cumulative-layout-shift, and interaction-to-next-paint metrics. Data is serialized into compact JSON payloads and dispatched via navigator.sendBeacon(), which guarantees delivery even during page unload events.

Critical Constraints & Routing Logic:

Payload Size: Strictly enforce < 64KB per beacon to bypass browser queue limits and avoid truncation.
Retry Fallback: Implement exponential backoff with fetch() as a fallback if sendBeacon() fails due to quota exhaustion.
Edge Routing: Deploy a reverse proxy at the network edge to terminate TLS, strip headers, and forward payloads to the ingestion cluster over HTTP/2.

flowchart LR
 A[Browser PerformanceObserver] -->|JSON Payload| B(navigator.sendBeacon)
 B --> C[Edge Reverse Proxy]
 C -->|TLS Termination & PII Stripping| D[Ingestion Batcher]
 D -->|Batched Inserts| E[(ClickHouse MergeTree)]
 E --> F[Grafana / BI Layer]

When evaluating foundational infrastructure decisions for RUM Architecture, Tooling & Self-Hosting, engineers must map edge-to-origin latency to ensure beacon dispatch does not block main-thread execution.

ClickHouse Table Schema & Storage Engine Configuration

High-throughput web metrics require columnar storage optimized for time-series aggregation. A robust schema uses the MergeTree engine with explicit partitioning, deterministic sorting keys, and aggressive compression.

CREATE TABLE rum_events
(
 `event_id` UUID,
 `timestamp` DateTime64(3),
 `project_id` LowCardinality(String),
 `session_id` LowCardinality(String),
 `page_url` String,
 `device_tier` LowCardinality(String),
 `browser` LowCardinality(String),
 `os` LowCardinality(String),
 `connection_type` LowCardinality(String),
 `lcp_ms` UInt32 CODEC(ZSTD(1)),
 `cls_score` Float32 CODEC(ZSTD(1)),
 `inp_ms` UInt32 CODEC(ZSTD(1)),
 `fcp_ms` UInt32 CODEC(ZSTD(1)),
 `raw_payload` String DEFAULT ''
)
ENGINE = MergeTree()
PARTITION BY toDate(timestamp)
ORDER BY (project_id, device_tier, timestamp)
TTL timestamp + INTERVAL 90 DAY DELETE
SETTINGS index_granularity = 8192;

Storage Optimizations:

LowCardinality(String): Applied to categorical dimensions (browser, OS, connection type) to reduce disk footprint by 60-80% while maintaining fast dimensional slicing.
CODEC(ZSTD(1)): Balances compression ratio and CPU overhead for numeric metrics.
TTL: Automatically purges raw payloads older than 90 days to control storage costs.
Materialized Columns: Derive CWV pass/fail flags at insert time using MATERIALIZED columns to avoid runtime computation overhead during aggregation.

Ingestion Optimization & Write Amplification Control

Raw beacon streams frequently exceed baseline ingestion capacity during traffic spikes. Direct, high-frequency INSERT statements cause excessive part fragmentation and degrade MergeTree background compaction. Implement server-side batching and deterministic sampling at the collector layer.

Nginx Ingress Configuration:

http {
 client_max_body_size 64k;
 client_body_buffer_size 16k;
 
 upstream clickhouse_ingest {
 server 127.0.0.1:8123;
 keepalive 32;
 }

 server {
 listen 443 ssl;
 location /rum/collect {
 proxy_pass http://clickhouse_ingest;
 proxy_http_version 1.1;
 proxy_buffering on;
 proxy_buffer_size 4k;
 proxy_buffers 8 4k;
 proxy_connect_timeout 5s;
 proxy_read_timeout 10s;
 }
 }
}

Batching & Sampling Strategy:

Batch Window: Accumulate beacons for 500ms or 100 rows, whichever triggers first. Flush via INSERT INTO rum_events FORMAT JSONEachRow.
Deterministic Sampling: Apply hash-based sampling to maintain statistical parity without biasing p75/p90 distributions.

-- Collector-side filter (20% sample)
WHERE cityHash64(session_id) % 100 < 20

Aligning ingestion throughput with established Self-Hosted Beacon Collection strategies prevents write amplification while preserving metric fidelity.

Core Web Vitals Aggregation & Percentile Queries

Accurate CWV reporting requires distribution-aware metrics. Averages obscure tail latency and layout instability. ClickHouse’s quantileTiming and quantileExact functions compute precise percentiles aligned with Google’s field data thresholds.

Percentile SQL Templates:

SELECT
 toDate(timestamp) AS day,
 quantileTiming(0.75)(lcp_ms) AS p75_lcp,
 quantileExact(0.90)(cls_score) AS p90_cls,
 quantileTiming(0.75)(inp_ms) AS p75_inp,
 count() AS sessions
FROM rum_events
WHERE timestamp >= now() - INTERVAL 7 DAY
 AND project_id = 'prod-frontend'
GROUP BY day
ORDER BY day DESC;

Indexing & Execution Optimization:

Primary Sort Key: (project_id, device_tier, timestamp) enables rapid prefix filtering for dashboard queries.
Skipping Indexes: Accelerate JSON path extraction if storing nested payloads:

ALTER TABLE rum_events ADD INDEX idx_payload_type payload.type TYPE minmax GRANULARITY 4;

Execution Verification: Run EXPLAIN indexes = 1, actions = 1 before deploying queries to production. Ensure ReadFromStorage steps show Filter and Sort operations utilizing the primary key.

Privacy Enforcement & Dimensional Segmentation

Self-hosted telemetry must enforce strict data minimization and regional compliance boundaries. PII stripping and IP anonymization should occur at the reverse proxy before persistence.

Implementation Controls:

IP Truncation: Hash and truncate IPs at ingestion using IPv6StringToNum() and bitwise masking:

IPv6StringToNum(client_ip) & toIPv6('ffff:ffff:ffff:ffff:0000:0000:0000:0000')

Session Stitching: Replace third-party cookies with ephemeral, first-party tokens rotated every 24 hours. Bind sessions to navigator.hardwareConcurrency and screen.width for device fingerprinting without cross-site tracking.
GeoIP & Device Tier Mapping: Load external dictionaries for fast dimensional joins:

CREATE DICTIONARY geo_mapping
(
`ip_prefix` UInt32,
`country_code` String,
`region` String
)
PRIMARY KEY ip_prefix
SOURCE(CLICKHOUSE(TABLE 'ip_geo_blocks' DB 'default'))
LIFETIME(MIN 300 MAX 3600)
LAYOUT(HASHED_ARRAY());

Sanitized datasets enable granular Device Tier Analysis and Geographic Performance Breakdowns without violating GDPR/CCPA data residency requirements.

Pipeline Validation & Debugging Checklist

Operational stability requires continuous validation of ingestion health, query accuracy, and storage compaction. Use the following triage workflow to diagnose pipeline degradation.

Symptom	Diagnostic Query / Action	Resolution
Slow Aggregations	`SELECT query, elapsed FROM system.query_log WHERE type='QueryFinish' AND elapsed > 5 ORDER BY elapsed DESC LIMIT 10;`	Add missing `WHERE` predicates, verify `ORDER BY` aligns with sort key, or increase `max_threads`.
Merge Backlog	`SELECT database, table, parts, merges FROM system.tables WHERE database = 'default' AND table = 'rum_events';`	Increase `background_pool_size`, verify disk IOPS, or reduce `insert_quorum`.
Dropped Beacons	Monitor Nginx `413 Request Entity Too Large` and `499 Client Closed Request` logs.	Increase `client_max_body_size`, optimize payload compression, or verify CORS preflight headers.
Compression Drift	`SELECT table, formatReadableSize(data_uncompressed_bytes), formatReadableSize(data_compressed_bytes) FROM system.parts WHERE table = 'rum_events';`	Verify `CODEC` settings, force `OPTIMIZE TABLE FINAL` during maintenance windows.

Validation Workflow:

Synthetic Injection: Simulate beacon payloads via curl -X POST -H "Content-Type: application/json" -d '{"timestamp": "...", "lcp_ms": 1200}' https://edge.example.com/rum/collect
Parity Check: Compare field telemetry distributions against synthetic lab runs (Lighthouse/WebPageTest) to confirm pipeline fidelity.
Index Verification: Run EXPLAIN on production dashboard queries to confirm ReadFromMergeTree utilizes primary key filtering and skipping indexes.