05 · When data really hits disk: the page cache and fsync

Calling file.write_all(b"hello") does not mean the data has landed on disk. There are buffers on both the Rust side and the OS side, and the data passes through two stages before it ever touches physical storage. If the power dies in between — the data is gone. This chapter is about understanding when data actually makes it to disk and how Quipu-Log controls that.

DB ↔ Filesystem

In a DB, the WAL is fsync'd at COMMIT time to guarantee no data loss. In Quipu-Log, we make that fsync call ourselves — using SyncPolicy to choose between "fsync on every append," "fsync every N appends," and "leave it to the OS."

Two-stage buffering: user space + kernel page cache

When you call write(), data actually passes through two separate buffers.

Writes pass through two buffers. BufWriter (user space) → page cache (kernel) → disk (physical). flush() drains the first buffer; fsync() pushes the page cache all the way to disk.

Breaking it down:

BufWriter (user-space buffer): when you call write(), data first accumulates in BufWriter's internal buffer (256 KB in Quipu-Log). It only moves to the kernel when the buffer fills up or you explicitly call flush().
Kernel page cache (OS buffer): once flush() hands data to the kernel, the OS doesn't write it to disk immediately. It holds it as "dirty pages" in RAM and writes them out together later (write-back). If the power dies at this point — it's gone.
fsync(): calling fsync() forces the OS to write dirty pages to disk right now. Once this call returns, a power failure won't lose the data. The trade-off: it's slow because it involves a round-trip to the disk.

In Quipu-Log: BufWriter, flush, and sync

Segment uses a BufWriter<File>. append() writes to the BufWriter, flush() gets data to the page cache, and sync() gets it all the way to disk.

crates/quipu-core/src/storage/segment.rspub fn flush(&mut self) -> Result<()> {
    self.writer.flush()?;   // BufWriter → page cache (OS memory)
    Ok(())
}

pub fn sync(&mut self) -> Result<()> {
    self.writer.flush()?;
    self.writer.get_ref().sync_data()?;  // page cache → physical disk
    Ok(())
}

sync_data() is Rust's wrapper around fdatasync(2). Similar to fsync(2), but it skips metadata updates (like access time), making it slightly faster. For data durability it's sufficient.

SyncPolicy: the durability vs. throughput trade-off

fsync-ing after every append is safe but slow. Leaving it to the OS is fast but risks losing recent records if the power dies. Quipu-Log lets you pick your trade-off with SyncPolicy.

crates/quipu-core/src/store.rspub enum SyncPolicy {
    Always,          // fsync after every append. Safest, slowest.
    EveryN(u32),     // fsync every N appends. Middle ground.
    OsManaged,       // no explicit fsync. Leave it to the OS. Fastest.
}

Seeing where each option actually applies makes it concrete:

crates/quipu-core/src/store.rs — apply_sync_policy()match self.cfg.sync_policy {
    SyncPolicy::Always => self.sync_all()?,
    SyncPolicy::EveryN(n) => {
        self.appends_since_sync += 1;
        if self.appends_since_sync >= n {
            self.sync_all()?;   // fsync on every Nth append
        } else {
            self.logs.flush()?; // all others: page cache only
            self.relations.flush()?;
        }
    }
    SyncPolicy::OsManaged => {
        self.logs.flush()?;    // only drain BufWriter, no fsync
    }
}

The trade-off in numbers

The benchmark figures in the README show the difference (Apple M4, NVMe SSD):

SyncPolicy	Durable throughput	Max data loss on power failure
`OsManaged`	~56,000 events/s	Whatever was in the OS write-back window (tens of ms to seconds)
`EveryN(64)`	~4,800 events/s	Up to 63 events
`Always`	~750 events/s (estimated)	0 events (returns only after fsync)

EveryN(64) is the default — for most audit log workloads, "up to 63 events possibly lost" is an acceptable risk, and the throughput is plenty. In strict environments like HIPAA, consider Always.

DB ↔ Filesystem

PostgreSQL's synchronous_commit = off corresponds to OsManaged; on (the default) corresponds to Always. MySQL InnoDB's innodb_flush_log_at_trx_commit = 2 is similar to OsManaged. The choices a DB engine makes internally are choices we make explicitly here, with SyncPolicy.

Caution

OsManaged is safe against application crashes — the page cache is managed by the OS, so even if the process dies, the OS stays up and the page cache survives. Data loss is only a risk during full system power loss (power cut, OS crash, forced reset). On cloud VMs, the hypervisor typically flushes write-back quickly, so the practical risk may be low — but it's not guaranteed.

Check yourself

① Explain the difference between flush() and fsync() in terms of "how far does the data travel?"
② If you're using SyncPolicy::EveryN(64) and the power dies, how many events can you lose at most?
③ Why does OsManaged behave differently for an "application crash" versus a "system power loss"?