20 · The Merkle history tree: committing the log to 32 bytes

How can you prove that millions of audit records haven't changed by a single character? Re-reading the entire log every time is too slow, and attaching a hash to each record individually leaves them open to piecemeal forgery. A Merkle tree solves both problems at once — it "commits" millions of records into a single 32-byte root, and changing even one byte flips that root.

In one sentence

A Merkle tree is a binary tree of recursively stacked hashes. Leaves hold record hashes, interior nodes hold hashes of their two children, and a single 32-byte root "locks" the entire log.

Intuition first: hashing the hashes

Recall the key property of a hash function (Ch. 19, SHA-256) — change even 1 bit of input and the output changes completely. What if we apply that property recursively?

Compute the hash of each of 4 records → H(R0), H(R1), H(R2), H(R3)
Pair adjacent hashes and hash again → H(H(R0)||H(R1)), H(H(R2)||H(R3))
Hash those two together → a single root

Now, flip 1 bit in R0 and H(R0) changes, its parent node changes, and the root changes. Comparing that one root tells you whether anything, anywhere, has changed.

n=4 Merkle tree. Leaves use SHA-256(0x00 || record); interior nodes use SHA-256(0x01 || left child || right child). The different prefixes make it impossible to confuse a leaf with an interior node (RFC 6962 §2.1).

RFC 6962 rules: prefixes matter

Quipu-Log follows the hashing rules from Google's Certificate Transparency standard, RFC 6962, exactly. The key rule is using different prefixes for leaves and interior nodes:

RFC 6962 §2.1 (excerpt from quipu-log/crates/quipu-core/src/merkle.rs docs)MTH({})        = SHA-256()               // empty tree
leaf hash      = SHA-256(0x00 || record)  // leaf: raw record bytes
interior node  = SHA-256(0x01 || left || right) // interior node

Why do we need prefixes? Without them, using SHA-256(H0 || H1) would make it impossible to distinguish "the hash of the concatenation of two leaves H0 and H1 treated as a single leaf" from "H0||H1 hashed as a new interior node." This ambiguity enables what's called a second-preimage confusion attack. Prepending 0x00 to leaves and 0x01 to interior nodes makes these cases mathematically impossible to collide.

Following the standard has another benefit: anyone familiar with RFC 6962 can write their own external verifier. This isn't Quipu-Log's own proprietary rule, so legal audits or third-party verification have a published reference document.

Quipu-Log's implementation: leaf_hash / node_hash / mth

Here is the code that directly translates the RFC's recursive definition:

crates/quipu-core/src/merkle.rspub fn leaf_hash(record: &[u8]) -> Hash {
    let mut h = Sha256::new();
    h.update([LEAF_PREFIX]);  // 0x00
    h.update(record);
    h.finalize().into()
}

pub fn node_hash(left: &Hash, right: &Hash) -> Hash {
    let mut h = Sha256::new();
    h.update([NODE_PREFIX]);  // 0x01
    h.update(left);
    h.update(right);
    h.finalize().into()
}

/// Recursive definition: root of D[n]
pub fn mth(leaves: &[Hash]) -> Hash {
    match leaves.len() {
        0 => empty_root(),     // SHA-256()
        1 => leaves[0],          // single leaf, return as-is
        n => {
            let k = split(n);  // largest power of 2 less than n (RFC split point)
            node_hash(&mth(&leaves[..k]), &mth(&leaves[k..]))
        }
    }
}

split(n) is worth noting. For n=4, k=2; for n=5, k=4; for n=6, k=4. RFC 6962 always uses the largest power of 2 less than n as the size of the left perfect subtree. This ensures the left subtree is always a perfect binary tree, which makes incremental updates (see Roots below) efficient.

Incremental root tracking: Roots

Recomputing the whole tree from scratch every time a new record arrives is O(n) — too slow. The Roots struct maintains only a list of "peaks" of perfect binary subtrees, making appends amortized O(1):

crates/quipu-core/src/merkle.rs — Roots::pushpub fn push(&mut self, leaf: Hash) {
    let mut carry = (0u32, leaf);
    while let Some(&(h, _)) = self.peaks.last() {
        if h != carry.0 { break; }
        let (_, left) = self.peaks.pop().unwrap();
        carry = (carry.0 + 1, node_hash(&left, &carry.1));
    }
    self.peaks.push(carry);
    self.size += 1;
}

When two peaks of the same height are found, they are merged into one peak one level up. This is exactly the carry propagation of binary addition. Even with a million records, the peak list has at most 20 entries (log₂(1,000,000) ≈ 20).

Analogy

Picture a Merkle tree as a stack of receipt bundles. First you bundle receipts two-by-two into "subtotal envelopes," then bundle pairs of those into "section envelopes," and finally the one remaining "grand-total envelope" is the root. Change any receipt and every envelope up the chain changes — and ultimately the grand-total envelope changes. An auditor only has to compare that one envelope.

The Merkle spine file: provable even after retention

Segment files are deleted according to the retention policy (Ch. 17, deletion and preservation). When records are deleted, do their leaf hashes disappear too?

That's why there is a separate spine file called MerkleLog (merkle.spine). It persists only the 32-byte leaf hash of each record, independently of retention. Even after a record's payload is gone, its hash remains:

crates/quipu-core/src/merkle_log.rs — file format// [ "QMKL" magic ][ u8 version ][ leaf_0 (32B) ][ leaf_1 (32B) ] ...
// The spine is not subject to retention — hashes carry no payload,
// so keeping them forever leaks no secrets, and inclusion/consistency
// proofs remain generatable even after a purge.

Security note

A hash cannot be reversed to recover the original (one-wayness), so keeping leaf hashes indefinitely does not leak confidential information. Keeping the records themselves indefinitely, on the other hand, could violate regulations like GDPR. Retaining only the hashes achieves the ideal balance: "provable, but not readable."

Recap

Merkle tree: leaves (record hashes) → interior nodes (hash of two child hashes) → one root (32 bytes) that commits the entire log.
RFC 6962 rules: leaves use SHA-256(0x00 || record), interior nodes use SHA-256(0x01 || left || right). The prefixes block confusion attacks.
Roots keeps only peaks, making appends amortized O(1) with O(log n) space.
The merkle.spine file preserves leaf hashes separately → proofs remain possible after retention.

Check yourself

① In a Merkle tree with n=5, split(5)=4. The left subtree has 4 leaves and the right has 1. What is the root of the right subtree?
② What attack becomes possible if you remove the leaf prefix 0x00 and interior node prefix 0x01? Explain.
③ Why can the Roots struct compute the current root in O(log n) space and time while maintaining only a list of peaks?