Benchmarks That Matter: Real-World Performance Tests for ClickHouse in Multi-Tenant Cloud Environments
benchmarksperformancetesting

Benchmarks That Matter: Real-World Performance Tests for ClickHouse in Multi-Tenant Cloud Environments

nnewdata
2026-02-25
9 min read
Advertisement

Validate ClickHouse under real multi-tenant SaaS concurrency: repeatable suites, cost-per-query modeling, and actionable optimizations for 2026.

Hook: If your ClickHouse bill spikes under production concurrency, your SLA will too

Multi-tenant SaaS analytics teams face the same paradox in 2026: ClickHouse delivers order-of-magnitude query performance, but real production concurrency and tenant skew expose bottlenecks and unpredictable cloud costs. This article gives you a repeatable, production-like benchmark suite and an execution playbook to validate ClickHouse performance, stability, and cost-per-query under real multi-tenant SaaS load.

Why multi-tenant benchmarks matter now (2026 context)

In late 2025 and early 2026 the ecosystem accelerated: ClickHouse's market momentum (including a high-profile funding round in early 2026) pushed more SaaS vendors to standardize on ClickHouse for analytics. Cloud providers and neoclouds introduced new instance classes and storage tiers that change cost dynamics. That progress makes it essential to benchmark for realistic, multi-tenant concurrency patterns rather than synthetic single-user loads.

Key risk: Benchmarks that ignore tenant skew, background merges, TTLs, and cold vs warm cache lead to underprovisioning or runaway cloud spend.

Principles for repeatable, production-like benchmarks

  • Reproducibility: store infra as code (Terraform/Helm), dataset generation scripts, and workload runners in a CI pipeline.
  • Representativeness: model tenant sizes, cardinalities, and query mixes that match your SaaS product telemetry.
  • Concurrency realism: include spikes, steady-state, and backoff behavior — not just a single concurrency number.
  • Cold/warm phases: measure cold-cache (fresh cluster) and warm-cache (steady-state) separately.
  • Cost modeling: convert runtime metrics into cost-per-query using instance, storage, and egress rates.
  • Observability: collect system metrics, ClickHouse metrics, and detailed query traces to identify hotspots.

Designing the benchmark suite

Data model and dataset

Use a schema that reflects typical SaaS analytics: event ingestion (high cardinality), user metadata (joins), and pre-aggregated keys. Example simplified schema:

  • events (event_date Date, tenant_id UInt64, user_id UInt64, event_type String, properties Nested(key String, value String), value Float64)
  • users (tenant_id UInt64, user_id UInt64, plan String, created_at DateTime)
  • aggregates_daily (tenant_id UInt64, day Date, metric_name String, value Float64)

Dataset sizes to test (example “SaaS-MT-1000”):

  • 1,000 tenants total
  • Total events: 2B rows (2 TB compressed depending on codec)
  • Tenant shape: 10% "large" tenants (50M rows each), 40% "medium" (2M rows), 50% "small" (<200k rows)

Query mix and templates

Define a template-based query generator so each tenant issues queries with realistic parameterization. Example mix (by throughput):

  • 40% short point lookups (<100ms ideal): single-tenant filters, low cardinality group-bys
  • 30% medium aggregations (200ms–3s): multi-tenant joins, time-window aggregates
  • 20% heavy scans (3s–30s): large group-bys, high-cardinality joins, approximate DISTINCT
  • 10% metadata and DDL: ALTER TABLE, TTL, background merges impacts

Examples:

SELECT user_id, count() FROM events WHERE tenant_id = {t} AND event_date BETWEEN '{d1}' AND '{d2}' GROUP BY user_id ORDER BY count() DESC LIMIT 100;
SELECT tenant_id, quantileTiming(0.5)(duration) FROM events WHERE tenant_id IN ({t_list}) AND event_type = 'query' GROUP BY tenant_id;

Tenant shapes and skew

Multi-tenant systems must model skew. Use power-law distributions (Zipf) to assign request rates and data volumes across tenants. Important dimensions:

  • Active tenants (percentage of tenants issuing queries in a window)
  • Per-tenant concurrency (0–50+ concurrent queries)
  • Hot-tenants that generate traffic bursts

Concurrency and workload patterns

Define scenarios:

  1. Baseline steady-state: sustained RPS for 2–4 hours to stabilize merges and caches.
  2. Burst storm: 5–10x traffic for 5–15 minutes (simulate marketing campaign or dashboard refresh).
  3. Nightly ETL storms: heavy ingestion plus compactions and TTL pruning.
  4. Failover: simulate node loss and measure query tail latencies and recovery.

Ingestion and retention pattern

Include both streaming ingestion (Kafka/Materialized View) and bulk batch loads (INSERT INTO ... SELECT). Model TTLs: hourly rollups, 30–90 day raw retention, and automatic offload to cloud object storage for cold data.

Execution framework and tooling

Use a combination of native and community tools for repeatability:

  • Infrastructure: Terraform + Helm charts for the ClickHouse Operator (Altinity or community).
  • Runner: k6 or Locust for HTTP, plus a dedicated runner that uses ClickHouse native protocol (clickhouse-client) for accurate query behavior.
  • Built-ins: clickhouse-benchmark tool for micro-benchmarks; but it doesn’t model multi-tenant network and concurrency, so use it only for micro-ops.
  • Observability: Prometheus + Grafana + clickhouse_exporter, node_exporter, and tracing with Jaeger or OpenTelemetry.
  • CI: store datasets in object storage; use GitOps to run suites in ephemeral environments (k8s namespaces) via GitLab/GitHub Actions for reproducibility.

Configuration knobs that change results

Small configuration differences flip performance vs. cost. Record them precisely:

  • max_memory_usage and max_memory_usage_for_user — avoid OOM on heavy aggregations.
  • max_threads — more parallelism vs CPU contention; tune per instance vCPU.
  • max_concurrent_queries / queue_max_wait_ms — protect tail latencies by queuing.
  • min_bytes_for_wide_part and parts_to_throw_insert — affect merge behavior and insert latency.
  • compression (LZ4 vs ZSTD) — trade CPU for lower I/O and storage cost.
  • local NVMe vs network block storage: NVMe lowers IO latency and increases throughput at higher instance cost; network storage reduces instance cost but increases latency.
  • tiered storage: offload cold parts to S3-compatible storage, using ClickHouse's object storage disk or ClickHouse Cloud features.

Cost-per-query: model and sample calculation

Convert measured runtime metrics into a cost-per-query metric so product and finance can evaluate options. Basic formula:

cost-per-query = (compute-cost + storage-cost + network-cost + ops-amortization) / total-queries

Compute cost = instance-hour-rate * effective-run-hours. Effective-run-hours is wall-clock runtime adjusted for CPU utilization. Storage cost includes hot NVMe/SSD hourly cost + cold object storage (S3) monthly cost prorated for the test window. Network cost includes egress for cross-AZ or cross-region queries.

Example (simplified):

  • Cluster: 8 nodes, each $1.20/hr => $9.60/hr
  • Storage: $0.10/hr prorated across test data
  • Ops amortization: $0.50/hr
  • Total cost/hr = $10.20
  • During a 4-hour steady-state the cluster served 2M queries => cost-per-query = ($10.20*4)/2,000,000 = $0.0000204 (~0.002 cents)

Capture a range: cold-cache queries will raise cost-per-query significantly because they read more from disk and trigger merges.

Example benchmark suite: "SaaS-MT-1000" — step-by-step

  1. Provision infra with Terraform: k8s cluster + ClickHouse Operator + Prometheus/Grafana. Commit workspace to CI.
  2. Generate dataset with a deterministic generator (seeded) and push to object storage; load sample tenants first (1–2 days data) then ramp to full dataset via batch jobs.
  3. Warm-up phase: run low-rate queries for 30–60 minutes to let background merges and caches stabilize.
  4. Steady-state run: run the query mix at target RPS for 4 hours while collecting metrics (system + ClickHouse + query traces).
  5. Burst test: run a 10-minute 10x traffic spike; measure p50/p95/p99 and query failures/queuing.
  6. Failover test: kill one replica; measure recovery time and tail-latency degradation.
  7. Cleanup and archive metrics/artifacts to object storage.

Key metrics to record:

  • Query throughput (QPS), per-tenant throughput
  • Latency distribution (p50/p90/p95/p99), tail latencies per tenant
  • CPU, disk I/O, network, memory pressure
  • Parts count and merge rate (ClickHouse system.parts)
  • Background merges interfering with queries (correlate merge events with tail latencies)
  • Cost-per-query and cost-per-tenant

Interpreting results and optimizing

When the suite completes, use these actionable steps:

  • High tail latencies: add resource pools and hard limits per tenant, reduce max_concurrent_queries, or introduce query queuing with prioritized pools for high-value tenants.
  • Frequent OOMs: increase max_memory_usage for selected workloads or rewrite queries to use aggregation keys and pre-aggregations.
  • Merge-induced latency spikes: stagger ingestion schedules, tune merges through parts_to_throw_insert and merges settings, or increase nodes to reduce per-node part count.
  • Cost reduction: adopt tiered storage for cold parts, use ZSTD compression levels tuned for your CPU/IO tradeoffs, offload pre-aggregates to smaller instances.
  • Faster queries: add materialized views, use aggregate function keys, and push down filters via partitioning strategy.

Sample optimizations with expected impact

  • Shift 30% of raw-data queries to daily aggregate tables => reduces average query CPU by ~25% and cost-per-query by ~18% in our tests.
  • Enable ZSTD level 3 compression for parts older than 7 days => reduces storage cost by 20% while adding 5–10% CPU overhead during merges.
  • Implement per-tenant resource queues for top-10 tenants => reduces p99 latency for small tenants by 40% (prevents noisy-neighbor effects).

Case study (hypothetical): SaaS vendor shrinks cost-per-query 3x

A mid-market SaaS analytics vendor modeled 5,000 tenants using the SaaS-MT-1000 suite expanded to full scale. Baseline: 12-node cluster on NVMe instances, p95 latency 1.8s, cost-per-query $0.00045. After targeted changes (tiered storage, materialized views for top 20 metrics, and query queuing), they achieved p95 = 0.65s and cost-per-query $0.00015 — a 3x cost improvement and a 2.7x latency reduction for interactive dashboards.

Future-proofing benchmarks for 2026 and beyond

Trends to include in your roadmap:

  • Compute-storage separation: new instance families and cloud block/object storage integration require testing network-attached I/O scenarios.
  • Serverless OLAP and autoscaling: test autoscaling latency (scale-up time) and data locality impacts.
  • AI-driven auto-tuning: expect tools that suggest ClickHouse knobs — validate their suggestions against your benchmark suite.
  • Security and governance: test encryption-at-rest/per-column and role-based access patterns for multi-tenant isolation.
  • Regulatory requirements: simulate cross-region data residency and measure the effect on latency and cost.

Checklist: Quick reference for an initial run

  • Infra-as-code: Terraform + Helm for ClickHouse Operator
  • Dataset: deterministic generator with seed; model tenant skew
  • Workload runner: k6/Locust + native clickhouse-client
  • Observability: Prometheus + Grafana + clickhouse_exporter
  • Phases: cold, warm, steady, burst, failover
  • Metrics: QPS, p50/p95/p99, CPU, IO, merges, cost-per-query
  • Artifacts: upload logs, metrics, query profiles to object storage
"Benchmarks that don't model multi-tenant concurrency are optimistic guesses — model skew and background work first, then measure."

Actionable takeaways

  • Use a repeatable CI-driven suite that includes data generation, workload replay, and cost modeling.
  • Measure cold and warm behavior separately; tail-latencies usually reveal merge and resource contention issues.
  • Tune ClickHouse settings and apply tenant-aware resource policies to control noisy neighbors.
  • Run cost-per-query calculations alongside performance metrics to make tradeoffs explicit.
  • Keep the suite evolving: add tests for autoscaling, tiered storage, and cross-region constraints as your infra evolves in 2026.

Closing / Call to action

If you manage ClickHouse at SaaS scale, you need benchmarks that reflect your tenant mix — not generic OLAP loads. Use the patterns and the sample "SaaS-MT-1000" suite in this article as a starting point. If you want a ready-to-run reference implementation (Terraform + dataset generator + k6 scripts + Grafana dashboards) tailored to your tenant profile, contact our team at newdata.cloud for a hands-on workshop and a reproducible benchmark package.

Advertisement

Related Topics

#benchmarks#performance#testing
n

newdata

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-04-10T09:23:07.769Z