The Quipu-Log Book
Part 8 · Distribution, operations, scaling

33 · Single point of failure and availability: the client spool

The Quipu-Log server is a single process. That simplicity is what lets a single file lock and an unbroken hash chain per table make tamper-evidence straightforward. But there is a price — what happens to audit records while the daemon is down? This chapter covers the design that solves that "single point of failure" problem, not on the server side, but on the client side.

In one sentence

To avoid losing audit records when the daemon goes down, rather than complicating the server, the client holds them on disk and retransmits later. "Lost during downtime" becomes "delayed during downtime."

The fate of a single writer: SPOF

As covered in Ch. 13 single-writer, the Quipu-Log store places a file lock on the store root directory and allows exactly one process to write. That simplicity is what makes tamper detection possible — if multiple nodes were writing to the same chain simultaneously, you'd need consensus to know who's authoritative, and the moment a bug slips into that consensus logic, "not tampered" weakens to "probably not tampered."

But a single process = a single point of failure (SPOF). Daemon restarts, deployments, crashes — during any of these brief windows, what are the services sending audit events supposed to do?

Why this design?

The server-side solution would be to elect a leader across multiple nodes using a consensus protocol like Raft. But that requires distributed consensus over the chain head, making the audit integrity guarantee far more complex. Quipu-Log chose the opposite direction — keep the server simple and shift the availability burden to the client. The client-side solution only needs a single local file.

Three weapons: idempotent key · backoff · spool

The quipu-client crate is the reference implementation for this client-side durability. It has three layers.

Service emit(event) ① Generate idempotent key new_idempotency_key() UUIDv4, one per event ② Exponential backoff retry Backoff{base:100ms, max_retries:6, full jitter} ③ Disk spool Spool::append()+fsync → drain_spool() replay quipu-server POST /v1/logs success exhausted after server recovers drain_spool() 💡 occurred_at preserved Even if it arrives late, the action timestamp is the value the client stamped at emit time
Three layers of client-side durability. If a retry succeeds, the spool is never opened. If retries are exhausted and the event lands in the spool, drain_spool() replays it after the server recovers.

① Idempotent retransmission: retrying never double-records

If the server received a request but the connection dropped before the response arrived, the client has no way to know whether it succeeded or failed. Retrying naively can record the same event twice. Idempotency keys prevent this.

crates/quipu-client/src/retry.rs/// One UUIDv4 per event. The same key is reused for every retransmission.
pub fn new_idempotency_key() -> String {
    let mut bytes = [0u8; 16];
    rand::RngCore::fill_bytes(&mut rand::thread_rng(), &mut bytes);
    // RFC 4122 version 4, variant 1 — the same key is reused across retries.
    bytes[6] = (bytes[6] & 0x0f) | 0x40;
    bytes[8] = (bytes[8] & 0x3f) | 0x80;
    // ... hex encoding ...
}

The server keeps a sliding window of recently accepted keys in memory (65,536 by default). When the same key arrives again, it doesn't write anything — it responds with "status":"duplicate". This way the client can retry freely, and the server filters out duplicates.

Caution

The idempotency window lives in memory and disappears when the server restarts. A retransmission that straddles a restart can be recorded twice. However, both records will carry the same occurred_at, so the duplicate can be detected after the fact — this is a designed boundary, not a gap.

② Exponential backoff + full jitter: taming the thundering herd when the server comes back

When a server recovers from a failure, hundreds of waiting clients all try to reconnect at once. If they all retry on a fixed schedule, the next interval triggers another storm. Backoff solves this with full jitter.

crates/quipu-client/src/retry.rsimpl Backoff {
    pub fn delay(&self, attempt: u32) -> Option<Duration> {
        if attempt == 0 || attempt > self.max_retries { return None; }
        let exp = self.base.as_secs_f64() * self.multiplier.powi((attempt - 1) as i32);
        let capped = exp.min(self.max_delay.as_secs_f64());
        // Uniform random in [0, capped] — clients don't all pile in on the same tick.
        let jittered = rand::Rng::gen_range(&mut rand::thread_rng(), 0.0..=capped);
        Some(Duration::from_secs_f64(jittered))
    }
}

The defaults are: base 100 ms, multiplier 2.0, max delay 30 s, max retries 6. After all six attempts fail, the event passes to the next layer.

③ Disk spool: turning "lost" into "delayed"

If the server still isn't responding after all retries are exhausted, the event is written to a local disk file. That file is the Spool.

crates/quipu-client/src/spool.rsimpl Spool {
    pub fn append(&mut self, record: &SpoolRecord) -> io::Result<()> {
        let payload = serde_json::to_vec(record)?;
        let len = u32::try_from(payload.len())?;
        let crc = crc32fast::hash(&payload);
        // [len][crc32][payload] frame — identical to quipu-core segment framing.
        self.file.write_all(&frame)?;
        self.file.sync_data()?;  // fsync. This is the whole point of the spool.
        Ok(())
    }
}

The spool frame uses exactly the same [len][crc32][payload] structure as Ch. 9 record framing. A torn tail left by a crash is trimmed on open via CRC check. When the server recovers, drain_spool() retransmits records oldest-first, and only atomically renames away the ones that succeeded.

DB ↔ Filesystem

In a DB, the client library manages a connection pool and typically throws an exception when the connection drops. In Quipu-Log, the client directly implements the contract "events are never lost even if the server is gone" — by saving them to a local file and replaying later.

occurred_at: the record carries the time of the action, however late it arrives

Events replayed from the spool might reach the server minutes or hours late. But someone reading the server log cares about when the action actually happened, not when it arrived.

That's what the occurred_at field is for. The client stamps "now" when it creates the event and sends it along; the server writes that value as-is into the log. However late the delivery, the timestamp in the log is the time of the original action. Latency is introduced, but the accuracy of the record is preserved.

Cold standby vs. live failover

Even with clients buffering, if the server takes a long time to recover the spool keeps growing. Cold standby keeps that window short.

ApproachDescriptionPrerequisite
Cold standbyKeep a copy of the store root on another host ahead of time; when trouble strikes, start the daemon thereNo 2nd writer — bring down the primary first, then start the standby
Live failover (Raft etc.)Leader-follower replication, automatic leader electionRequires consensus over the chain head → outside Quipu-Log's scope

The cold standby procedure is straightforward. Gracefully stop the primary with SIGTERM (or POST /v1/admin/flush) to stabilize the active segment, then copy the store root to the standby host. Sealed segments are immutable, so only the last active tail needs to be copied incrementally. Start the daemon on the standby — it acquires the lock, trims any torn tail, and is ready to serve.

Restart speed is proportional to the active segment size. Keeping max_segment_bytes reasonably small makes both restarts and failovers faster — because, as covered in Ch. 12 crash recovery, sealed segments are not re-scanned on open.

Check yourself

① Why is the idempotent key "one per event, identical across all retries"? What would happen if you generated a fresh key on every attempt?
② If there were no full jitter and clients retried on a fixed schedule, what would happen right after the server recovered?
③ Even with a spool in place, there are cases where true "zero loss" cannot be guaranteed. When? (Hint: think about where the spool file itself lives.)