34 · Horizontal scaling: sharding, consistent hashing, read replicas

There is a ceiling on how many writes a single writer can handle. "Can't we just add more writers?" — fair question. But Quipu-Log's tamper-evidence is built on the assumption of a single writer. This chapter covers how to multiply write throughput by N without breaking that assumption: per-tenant sharding.

In one sentence

The invariant to preserve isn't "one global chain" — it's "one writer per chain." Put N independent chains in place and tamper-evidence holds exactly as before while write throughput scales by N.

Why the single writer is the bottleneck

Every event passes through a single writer thread. Internally, a pipeline queue (Ch. 29 async pipeline) provides buffering, but the thread consuming that queue is still just one. One CPU and one disk I/O, shared. To break through that ceiling, you need more writers.

The problem is that if multiple writers write to the same chain simultaneously, the chain head forks. You need consensus to decide which side is authoritative, and the moment a bug enters that consensus logic, the foundation of tamper detection collapses.

The solution: split the chain into multiple chains. Each chain still has exactly one writer. Events from different tenants simply go to different chains.

ShardRouter: write routing + read fan-out

ShardRouter is a thin coordinator that wraps N independent AuditPipelines. The core storage engine (quipu-core) is untouched — each shard is an AuditStore that operates exactly like today's single store, and ShardRouter is their "traffic cop."

ShardRouter routes writes to one shard based on the tenant, then fans reads out across all relevant shards and merges by timestamp. A frozen shard accepts reads only — it will never receive a write again.

Consistent hashing: scale up with minimal redistribution

The ShardMap decides which shard a tenant goes to. A naive tenant_id % N would move every tenant to a different shard whenever the shard count changes — a serious problem, since all previous history now lives on the wrong shard.

The answer is consistent hashing. Virtual nodes for each shard are placed on a hash ring, and the shard for a tenant is found by locating which virtual node comes after the tenant's hash on the ring. When a shard is added, only a fraction of tenants need to move.

crates/quipu-middleware/src/sharding/shard_map.rsconst VNODES_PER_SHARD: u32 = 64; // virtual nodes per shard

pub fn route_write(&self, tenant: &str) -> ShardId {
    let h = self.hash(&tenant);
    // First virtual node on the ring with key ≥ h → the shard that node belongs to.
    match self.ring.binary_search_by_key(&h, |(rh, _)| *rh) {
        Ok(i) => self.ring[i].1,
        Err(i) if i < self.ring.len() => self.ring[i].1,
        Err(_) => self.ring[0].1, // wrap around past the end of the ring
    }
}

Resharding is add-only: records never move

Append-only chains cannot be relocated. Moving a written record to a different shard breaks the chain and destroys tamper-evidence. So resharding works like this.

Freeze the existing shard. From this point on, that shard accepts no writes — ever.
Add a new empty shard and make it active.
All subsequent writes are directed to the new shard by consistent hashing.

The historical records in the frozen shard stay exactly where they are. Reads for a tenant query the current owning shard and all frozen shards (read_targets). Records never move, and the single-writer invariant per chain is never broken.

crates/quipu-middleware/src/sharding/shard_map.rspub fn read_targets(&self, tenant: Option<&str>) -> Vec<ShardId> {
    match tenant {
        Some(t) => {
            // Current active owning shard + all frozen shards
            // (pre-reshard history may live in frozen shards)
            let mut targets = vec![self.route_write(t)];
            targets.extend(self.frozen.iter());
            targets
        }
        None => self.all_shards().collect(), // no tenant filter → fan out to all
    }
}

Caution

Resharding only goes in one direction — you can add shards but never remove them. Plan your shard count carefully upfront. Also, the last remaining active shard cannot be frozen (writes would have nowhere to go).

Cross-shard queries: fan-out + timestamp merge

A query spanning multiple shards issues the same sub-query to each shard and merges the results by (timestamp_micros, log_id) in a k-way merge. Because there is no global total order, the timestamp is the ordering key.

Pagination is the tricky part. Without a global index, each page turn must carry along how far into each shard we've read. That's MultiCursor — a map of independent per-shard cursors.

crates/quipu-middleware/src/sharding/fanout.rs/// A resumable cross-shard cursor. Remembers the position after the last emitted row, per shard.
pub struct MultiCursor {
    per_shard: BTreeMap<u32, String>,  // opaque cursor per shard
    done: BTreeSet<u32>,               // exhausted shards — don't restart them
}

Analogy

Imagine reading several books simultaneously and editing their contents into a single chronological stream. You keep a bookmark in each book, and each time you advance, you compare the next date across all books and read whichever comes earliest. MultiCursor is that collection of bookmarks.

Global order is surrendered — integrity is not

Sharding means there is no single global timeline. Events across different shards are ordered only by clock-based timestamps. This is an intentional trade-off (ADR-2).

Integrity, however, is fully preserved. Each shard independently maintains its chain, checkpoints, and anchoring. On top of that, cross-shard anchoring (ADR-5, GlobalCheckpoint) detects "entire shard deleted or rolled back" — even if every individual shard passes its own verification, any change to the set of shards is caught by the global checkpoint.

Property	Single store	After sharding
Write throughput	1 writer's worth	~N× (number of shards)
Single writer per chain	✅ 1	✅ 1 per shard (invariant)
Global total order	✅ yes	❌ no (clock-based)
Per-shard tamper-evidence	✅	✅ identical
Global integrity	✅	✅ GlobalCheckpoint
Read replica	not possible (active shared)	✅ sealed segments via rsync/snapshot

Read replica: sealed segments are safe to copy as-is

Sealed segments, as described in Ch. 8 segments and rollover, are immutable — not a single byte ever changes. That means you can rsync or snapshot them to another host to create a query-only node. Integrity is unaffected — a copy is verified the same way with CRC and Merkle tree.

The one exception: the active (currently-being-written) segment tail may be incomplete at copy time, so exclude it. Run POST /v1/admin/flush before copying to get a consistent snapshot that includes the tail.

Check yourself

① Explain the difference between "one writer per chain" and "one global writer." How does sharding preserve the former while increasing throughput?
② Tenant A's history is split between shard-0000 (frozen) and shard-0002 (active). Which shards need to be queried? How does read_targets determine this?
③ Why is giving up global total order still sufficient for the purpose of an audit log? (Hint: think about per-shard order and per-shard checkpoints.)