"What format is this file, and what lives where?" In a DB, the system catalog (pg_catalog, information_schema) manages the location of tables, columns, and indexes. A storage engine built directly on files faces the same question. Quipu-Log's directory layout acts as the catalog, and the magic bytes plus FORMAT_VERSION handle format identification. The final story of this chapter is why the v1→v2 break happened.
The storage layout uses directory names as "table names," and the magic (ALOG) plus FORMAT_VERSION (2) in each segment header identify "what format this file is." A mismatch is rejected immediately — no mis-parsing.
What you already know: DB catalog and on-disk format version
Install PostgreSQL and initialize a cluster and you'll find a file called PG_VERSION containing the major version number ("16"). If a process from a different version tries to open that directory, it fails immediately. Individual data pages also carry a version number and magic value to confirm the page is actually in Postgres format — not a file from a different engine.
The catalog (pg_class, pg_attribute) stores metadata about "what tables exist and what type and position each column has." You have to read the catalog before you can parse any data file.
DB catalog: dedicated system tables manage schemas, indexes, and file locations. Quipu-Log: the directory name itself is the "table name" — logs/, registry/user/, and so on. There is no separate catalog file; the OS filesystem directory structure is the catalog. Type schemas (what fields each entity has) are an exception — they are stored as an append-only log in the meta/ table and replayed on restart.
Root directory layout
The comment in AuditStore::open() declares the layout.
crates/quipu-core/src/store.rs/// root/
/// meta/ type schemas + custom column registry (replayed on open)
/// logs/ AuditLog rows
/// relations/ log -> target-entity-version relations
/// registry/<t>/ one versioned registry table per entity/actor type
/// access/ (optional) meta-audit table
/// checkpoints/ signed integrity checkpoints
/// LOCK advisory OS lock file
Open one and you'll see exactly this.
A key property of this layout is that it is self-describing. Open a directory, run ls registry/, and you immediately know what entity types are registered. Replay meta/ and you have the schema for each type. No separate catalog file needed — the directory structure itself is the catalog.
magic + FORMAT_VERSION: file identification
Every segment file begins with a 14-byte header.
crates/quipu-core/src/storage/segment.rs/// Segment file header: magic + format version + base index.
pub const MAGIC: [u8; 4] = *b"ALOG"; // magic — "this file is a Quipu-Log segment"
pub const FORMAT_VERSION: u8 = 2; // current format version
pub const SEGMENT_HEADER: usize = MAGIC.len() + 1 + 8;
// ^4 ^1 ^8(base_index) = 13 bytes
When parsing the header, if the magic is not "ALOG" or the FORMAT_VERSION is not 2, an error is returned immediately. That is the mechanism that guarantees "no mis-parsing — rejected at once."
Imagine a letter envelope stamped with "Domestic Mail" and the expected postcode format. Before opening the envelope, the postal worker checks "is this domestic or international?" and returns it to sender if the format doesn't match. Magic and version play exactly that role — "can this file be opened?" is checked before reading the contents.
base_index: the absolute coordinate baked into the segment header
The base_index (8 bytes) in the header is the position of this segment's first record in the overall Merkle tree — which leaf number it corresponds to. The critical point is that this value is written once when the segment is created and never changes afterward.
Why bake it into the header? When retention deletes leading segments, you'd otherwise have to track "what overall position this segment's records start at" with a separate counter. Persisting that counter in a separate file means a crash between "counter update" and "segment deletion" leaves an inconsistency. If the absolute index is in the segment header itself, every segment always knows "my records start at spine leaf base_index + offset" — no matter which preceding segments have been deleted, the spine index for surviving records is always accurate as base_index + offset.
crates/quipu-core/src/storage/segment.rs — explanatory comment// `base_index` is the spine leaf index of this segment's first record —
// the count of all records appended to the table before this segment opened.
// Storing it per-segment (not as a mutable counter) is what makes the mapping
// crash-safe across a partial purge.
Format evolution: v0 → v1 → v2 break
Quipu-Log's segment format has gone through three versions.
| Version | Header layout | Change | Compatibility |
|---|---|---|---|
| v0 | magic 4 + ver 1 + seed 8 | initial format | — |
| v1 | magic 4 + ver 1 + seed 8 | added 32-byte chain-hash per frame | incompatible with v0 |
| v2 | magic 4 + ver 1 + base_index 8 | removed chain-hash, moved tamper-evidence to Merkle spine; seed slot replaced by base_index | incompatible with v1 |
In v1, each frame carried a 32-byte "chain hash" incorporating the previous record's hash. Intuitively tamper-evident — but it had a problem: when retention deleted leading records, the chain broke and verification became impossible. Tamper-evidence that must survive retention requires a separate structure.
v2 drops per-frame chain-hashes entirely and instead accumulates leaf hashes in a separate Merkle spine file (merkle.spine) that retention never touches. The segment file is responsible only for "fast sequential reads + CRC accidental-corruption detection"; integrity is the spine's job. This separation means "even if a segment is deleted, the root hash and inclusion proofs remain valid."
crates/quipu-core/src/storage/segment.rs — explanatory comment// Format v2 dropped the per-record hash chain entirely
// (no header seed, no per-frame chain hash).
// Tamper-evidence now lives in the retention-independent Merkle spine;
// a segment carries only payloads plus a CRC for accidental-corruption detection.
Opening a v1 store with a v2 binary is rejected immediately — there is no migration tool, so a re-export is required. This is intentional. At the pre-1.0 stage, while real-world data is limited, accepting a format break costs little. Once broader adoption happens, a migration path will be provided.
Sidecars: adding metadata without a format break
There is a way to add metadata without touching the segment header — a sidecar file. seg-0000000000.meta holds the min/max timestamps, record count, and base_index for its segment.
crates/quipu-core/src/storage/table.rsstruct SegmentMeta {
min_timestamp: u64, // minimum timestamp of records in the segment
max_timestamp: u64, // maximum timestamp (used for retention + pruning)
records: u64,
base_index: u64,
}
The essential property of a sidecar is that it is hint-only. If it is missing or corrupted, the segment can be rebuilt by skimming it from the start. Merkle verification never consults the sidecar, so even if a sidecar is tampered with, the integrity evidence is unaffected. This is what made it possible to add new metadata without bumping FORMAT_VERSION and breaking existing segments.
Extending the header requires bumping FORMAT_VERSION, which breaks compatibility with existing segments. A sidecar is "optional" — the code never assumes it is present. The principle: information that can be recomputed (like min/max timestamps) goes in the sidecar; information that cannot be recomputed (like base_index) goes in the header. This rule is how new features can be added without format breaks.
Recap
- Quipu-Log's directory layout acts as the DB catalog.
meta/: schema event log,logs/: log rows,registry/<type>/: registries,relations/: log-entity mappings. - The magic (
ALOG) + FORMAT_VERSION (2) in each segment header handles file identification. A mismatch is rejected immediately. - base_index is an absolute coordinate written once in the header at creation time — even after leading segments are deleted by retention, surviving records' spine indexes are always accurate.
- v1 → v2 break: per-frame chain-hash removed; tamper-evidence responsibility transferred to the retention-safe, separate
merkle.spine. - Sidecars (.meta) are a pattern for adding recomputable metadata without bumping the format version — hint-only, so loss or tampering has no effect on integrity.
① What do the DB catalog (pg_class, etc.) and Quipu-Log's layout have in common? In Quipu-Log, how do you find out what tables exist?
② Explain why base_index belongs in the segment header, by describing the problem with a "separate counter file" approach.
③ Explain why the v1→v2 format break was necessary from the perspective of "retention + integrity proof." What was the specific limitation of per-frame chain-hashes?