30 · Reliability: retries, backoff, DLQ, idempotency

An audit log must be "at-least-once" — not merely "best-effort." Events shouldn't silently vanish when the disk fills up briefly, the server restarts, or an I/O error fires. This chapter looks at how Quipu-Log keeps that promise.

In one sentence

Write failure → retry (exponential backoff + jitter) → if all retries fail, park the event in the DLQ on disk → replay later with redrive. Events end up in the DLQ at the very least.

Retry and exponential backoff

When the writer thread fails to write an event, it doesn't give up immediately. It retries up to max_retries times (default 3), as configured in PipelineConfig, waiting a little between each attempt. Having that wait grow on every attempt is exponential backoff.

The client-side retry logic is more precise. The Backoff struct applies full jitter on top of the exponential increase.

crates/quipu-client/src/retry.rspub fn delay(&self, attempt: u32) -> Option<Duration> {
    if attempt == 0 || attempt > self.max_retries { return None; }
    let exp = self.base.as_secs_f64() * self.multiplier.powi((attempt - 1) as i32);
    let capped = exp.min(self.max_delay.as_secs_f64());
    let jittered = rand::Rng::gen_range(&mut rand::thread_rng(), 0.0..=capped);
    Some(Duration::from_secs_f64(jittered))
}

Why does jitter matter? When a server goes down and comes back up, hundreds of clients that have been queuing events may all start retrying at the same moment — a thundering herd. Adding randomness to the wait time spreads those retries across different moments in time, distributing the load.

Analogy

If everyone calls emergency services at the exact same second after an earthquake, the switchboard collapses. Telling people "call at a random time within the next 30 seconds" spreads the load evenly — that's full jitter.

Idempotency keys: keeping retries from creating duplicate records

Retries have a trap. If the server saved the event but the response was lost in transit, the client sees a failure and retries — and the same event gets recorded twice. Idempotency keys prevent this.

crates/quipu-client/src/retry.rs// Generated once per event — the same key is used for every retry
pub fn new_idempotency_key() -> String {
    // Generates a UUIDv4 (RFC 4122 version 4, variant 1)
    // A second request carrying the same key is recognized as a duplicate and ignored
}

The key is generated when the event is first created and stored alongside it in the client-side spool. Even if the process restarts, the key survives to prevent duplicates.

DLQ: when all retries are exhausted

What if every retry fails? Rather than dropping the event, Quipu-Log parks it in a Dead-Letter Queue (DLQ). The DLQ is another append-only segment file under <store root>/dlq/ — a separate store from the main log. That separation means that even if the main disk is full (ENOSPC), a DLQ write is still attempted, and if the process dies, the file remains on disk.

An event either succeeds, gets parked in the DLQ, or as a last resort is delivered to the fallback hook. It never disappears silently.

redrive: replaying the DLQ

Events sitting in the DLQ are replayed by calling handle.redrive_dlq(&admin_role). redrive is designed to be crash-safe. It writes the replay results (successes and further failures alike) to a staging directory first, fsync's everything, then deletes the old DLQ and renames staging into place. If the process dies mid-replay, events that already succeeded may be replayed again (at-least-once) — but nothing is lost.

Caution

Because the guarantee is at-least-once, redrive can occasionally produce duplicate records. In an audit log, losing an event is far worse than recording it twice, so at-least-once is the right trade-off here.

The last line of defense: the fallback hook

If even the DLQ write fails (say, the disk is completely full), the event is permanently lost. In that case Quipu-Log calls the FallbackFn hook — a closure registered when the pipeline is started, where you can wire up an alert, write to stderr, or whatever makes sense for your situation.

examples/axum-demo/src/main.rsAuditPipeline::start(
    store, root,
    PermissionPolicy::allow_all(),
    PipelineConfig::default(),
    Some(Arc::new(|event, err| {
        eprintln!("AUDIT FALLBACK: {} {} failed: {err}", event.method, event.url);
    })),
)?;

Check yourself

① Explain why jitter is added on top of exponential backoff, using the term "thundering herd."
② Why is the idempotency key generated once per event and reused across every retry?
③ Why does redrive process events in the order "delete old DLQ → rename staging" rather than the other way around?