11 · Durability: fsync policy and group commit

If you've used a database, "commit once and it lives forever" probably felt like a given. But that promise doesn't appear by magic. Inside a DB, someone is calling fsync at just the right moments — and around the question of when to fsync, a constant tug-of-war plays out between durability and throughput. This chapter looks at how Quipu-Log surfaces that tug-of-war as a first-class policy at the file level.

In one sentence

SyncPolicy is a durability policy expressed in code — it answers "how often do we call fsync?" The safer you go, the slower things get; the faster you go, the wider the window of data you could lose on a power cut.

First, what you already know: commit durability in a DB

When you hit COMMIT in a relational DB, the engine flushes the WAL record to disk (fsync) and only then returns "success." That's the D in ACID — Durability. The catch is that calling fsync on every transaction is expensive. A single fsync on a spinning disk typically costs hundreds of microseconds to a few milliseconds; even an SSD usually takes tens of microseconds.

That's why high-performance databases have long relied on group commit. When several transactions request a commit at nearly the same time, the engine batches them and issues a single fsync for the whole group. "Write N records in one shot; hand success back to all N callers." Throughput climbs sharply, but anything committed after the last fsync is gone if the power dies.

DB ↔ Filesystem

In a DB, commit durability and group commit are hidden inside the engine. Developers tune them through parameters like innodb_flush_log_at_trx_commit or PostgreSQL's synchronous_commit. In Quipu-Log, the same decision is exposed directly in code as the SyncPolicy enum — a deliberate design choice to keep nothing hidden.

Two buffers: BufWriter and the OS page cache

To make sense of the fsync policy you first need to see that "a write" is really two separate steps. Ch. 5 covers page cache and fsync in depth, but a quick recap is useful here.

A write has two stages. flush() hands the user-space buffer off to the OS; fsync() forces the OS to commit it to disk. A power cut erases anything still in the OS cache.

Quipu-Log writes segment files through a BufWriter. It holds a 256 KB user-space buffer and passes data to the OS page cache when the buffer fills or when you call flush() explicitly. One more step — calling fsync — is needed to guarantee the data has actually reached the disk.

crates/quipu-core/src/storage/segment.rspub fn flush(&mut self) -> Result<()> {
    self.writer.flush()?;        // user space → OS page cache
    Ok(())
}

pub fn sync(&mut self) -> Result<()> {
    self.writer.flush()?;
    self.writer.get_ref().sync_data()?; // OS cache → disk (fdatasync)
    Ok(())
}

Three policies: SyncPolicy

The decision of when to invoke each of those two steps is what SyncPolicy encodes.

crates/quipu-core/src/store.rspub enum SyncPolicy {
    Always,       // fsync after every append. Safest, slowest.
    EveryN(u32),  // fsync after every N appends; otherwise only flush.
    OsManaged,   // Never fsync explicitly; rely on the OS to write back. Fastest.
}

Let's take them one at a time.

Always — fsync on every append

Every call to append is followed immediately by sync(). Even if the power dies a moment later, everything up through the most recent append is guaranteed to have made it to disk. The exposure window is essentially zero. The trade-off: every append involves a round-trip to disk, so throughput tops out in the hundreds-to-thousands of events per second range.

EveryN(n) — fsync every N appends (default: 64)

This is group commit. For N appends the data stays only in the OS cache (just flush); on the Nth append, sync() drives everything to disk at once. A power cut can lose up to N − 1 appends since the last fsync. But fsync calls drop by a factor of N, so throughput rises by nearly the same factor.

crates/quipu-core/src/store.rs — apply_sync_policy()match self.cfg.sync_policy {
    SyncPolicy::Always => self.sync_all()?,
    SyncPolicy::EveryN(n) => {
        self.appends_since_sync += 1;
        if self.appends_since_sync >= n {
            self.sync_all()?;   // Nth append: fsync
        } else {
            self.logs.flush()?;       // otherwise: flush only
            self.relations.flush()?;
        }
    }
    SyncPolicy::OsManaged => {
        self.logs.flush()?;
        self.relations.flush()?;
    }
}

sync_all() follows a specific fsync order: the registry first, then the log tables. The code comment explains why — you can't have a situation where a log record is on disk but the registry version it references isn't.

OsManaged — no explicit fsync

Only flush() is called, handing data to the OS cache; the OS decides when to write it to disk. A power cut loses everything still sitting in that cache. In exchange, there are no disk round-trips, so this is the fastest option. It's appropriate for development and testing environments where losing audit data is acceptable, or for servers with battery-backed UPS units.

The performance numbers: how much difference does the policy make?

These figures come directly from the README benchmark (Apple M4, NVMe SSD, rustc 1.96, release build).

SyncPolicy	Durable throughput	Exposure window
`OsManaged`	~56,000 events/s	Unbounded (until OS restart)
`EveryN(64)` (default)	~4,800 events/s	Up to 63 events
`Always`	(~hundreds–thousands events/s)	0 (guaranteed through last append)

The reason EveryN(64) is the default becomes clear. It's 12× slower than OsManaged, but the exposure window shrinks to at most 63 events — in practice, about the best balance you can strike for an audit log. Always is the right choice for environments — finance, healthcare — where not a single event can be lost.

Analogy

Think of a food-order notepad. Always locks each order in the safe the moment it's written. EveryN(64) collects 64 orders and then locks them all at once. OsManaged leaves the notepad on the desk and only locks it when you leave for the day. If there's a fire (power cut), only what's in the safe survives.

Caution

OsManaged means "fast, but no durability guarantee." In a VM or container environment where the host is frequently interrupted, this policy is not safe. If you need high throughput without fsync overhead, a more honest choice is to use Always behind a battery-backed RAID controller instead.

Recap

The same durability–throughput trade-off as DB commit durability and group commit surfaces at the file level under the name SyncPolicy.
Always = fsync on every append (safest, slowest); EveryN(n) = group commit (default, balanced); OsManaged = no fsync (fastest, no durability guarantee).
The write path is BufWriter (user buffer) → flush() (OS cache) → fsync() (disk). The policy controls how often that last step fires.
sync_all() flushes registry before logs, preventing a crash from leaving a log record whose registry entry doesn't exist yet.

Check yourself

① With EveryN(64), how many events can you lose at most in a power cut? Why is that an acceptable trade-off for an audit log?
② Describe the structural similarity between a DB's group commit and EveryN(n) in one sentence.
③ Why does sync_all() fsync the registry first and the logs second?