15 · Indexing: in-memory index + on-disk tokens

When an audit log tops ten million entries, finding "all of Alice's actions" without an index means reading the whole thing. Just as a relational DB builds a secondary B-tree index, Quipu-Log builds one too — only instead of writing a separate B-tree file to disk, it holds a HashMap in memory and rebuilds it by replaying the log on restart. This chapter looks at why that choice makes sense, and at how a disk-persisted "token" makes searching possible even for protected fields.

In one sentence

Quipu-Log's index is two layers: an in-memory HashMap (rebuilt by replaying the log on every restart) and disk-persisted blind tokens (for searching protected fields). Cryptography is Ch. 26's job; this chapter focuses on the data structures and the persist-and-rebuild mechanism.

What you already know: DB secondary indexes

In a relational DB, SELECT * FROM users WHERE email = 'alice@example.com' does a full table scan unless there is an index on email. Add one — CREATE INDEX ON users(email) — and the DB maintains a separate B-tree structure that maps email values to row addresses (heap tuple pointers), bringing lookups down to O(log n).

That B-tree index lives on disk in its own file. Every time a new row is inserted the index is updated alongside it, and it survives a DB restart just as the tables do.

DB ↔ Filesystem

In a DB, a secondary index exists as a separate B-tree file; every write atomically updates two places (table + index). In Quipu-Log, the index exists only as an in-memory HashMap and is rebuilt by scanning the registry log from the beginning (replay) on restart. Because there is no separate index file, there is no atomicity problem to solve — and because registry records are far fewer than log records, the rebuild cost is negligible.

The registry: the table that holds entity metadata

A log row records only "who (actor) did what." "Who Alice is" — her name, role, department, and other metadata — lives separately in the registry. The registry is also an append-only log, but what it stores is a version history of each entity.

The central data structure of the registry is RegistryIndex. That is the "index."

crates/quipu-core/src/registry.rsstruct RegistryIndex {
    schema: TypeSchema,
    /// entity_id -> list of version uids (oldest first)
    versions: HashMap<String, Vec<Uid>>,
    /// uid -> the actual record
    by_uid: HashMap<Uid, RegistryRecord>,
    /// index field -> search key -> set of entity_ids (primary lookup)
    index: HashMap<String, HashMap<Vec<u8>, HashSet<String>>>,
    /// blind token index: field -> token digest -> set of entity_ids
    token_index: HashMap<String, HashMap<String, HashSet<String>>>,
    // ...in-memory plaintext cache (opt-in)
}

Two layers that actually make lookups fast stand out. index is the primary lookup — it searches by plaintext value (or SHA-256/HMAC digest). token_index is the blind token index for prefix and substring searches.

When and how the index gets populated

RegistryIndex is filled through two paths.

1. At write time — absorb(): right after a new entity version is written to the on-disk registry log, it is reflected in the in-memory index.

crates/quipu-core/src/registry.rsfn absorb(&mut self, rec: RegistryRecord) {
    for (name, stored) in &rec.fields {
        if let Some(key) = index_key(stored) {
            self.index.entry(name.clone()).or_default()
                     .entry(key).or_default()
                     .insert(rec.entity_id.clone());
        }
    }
    for (field, tokens) in &rec.tokens {
        let posting = self.token_index.entry(field.clone()).or_default();
        for d in &tokens.digests {
            posting.entry(d.clone()).or_default()
                   .insert(rec.entity_id.clone());
        }
    }
    // versions / by_uid update...
}

2. At restart — rebuild: when the process starts fresh, TypeRegistry::open() scans the entire registry log and calls absorb() on every record.

crates/quipu-core/src/registry.rspub fn open(dir: &Path, schema: TypeSchema, ...) -> Result<Self> {
    let mut table = Table::open(dir, max_segment_bytes)?;
    let mut idx = RegistryIndex::new(schema, cache_plaintexts);
    let records: Vec<RegistryRecord> = table.scan()?.collect::<Result<Vec<_>>>()?;
    for rec in records { idx.absorb(rec); }
    Ok(Self { table, idx, fingerprints: HashMap::new() })
}

Registry records are entity versions, so a thousand users at an average of three versions each produces about 3,000 records. Scanning that once costs almost nothing — which is why "rebuild from scratch on every restart" beats "manage a separate index file" in terms of complexity.

The registry log is append-only, and the in-memory RegistryIndex is rebuilt by replaying the log on every restart. index enables plaintext lookups; token_index uses token digests to enable prefix and substring searches.

What persists to disk: FieldTokens

The index itself lives in memory. But protected fields — fields hashed with SHA-256, HMAC, or similar — present a special problem. Hashing is one-way: after a restart there is no way to recover the original value. Without the original, you cannot recompute the token digests needed for prefix and substring searches.

The solution is to compute the tokens at write time and persist them inside the record itself. That is exactly what the RegistryRecord.tokens field is for.

crates/quipu-core/src/registry.rspub struct RegistryRecord {
    pub uid: Uid,
    pub entity_id: String,
    pub fields: BTreeMap<String, StoredValue>,  // protected values
    /// Blind index token digests, computed and persisted at write time.
    /// Enables searching protected fields (whose plaintexts are not on disk)
    /// even after a restart.
    pub tokens: BTreeMap<String, FieldTokens>,
}

pub struct FieldTokens {
    pub key_version: u32,  // which HMAC key produced these digests
    pub digests: Vec<String>,   // e.g. ["digest(ali)", "digest(lic)", "digest(ice)"]
}

For example: if a user's name "Alice" is protected with HMAC-SHA256 and the schema declares a trigram (n=3) index, then at write time the digests of "ali", "lic", and "ice" are computed and stored with the record. When absorb() reads that record after a restart, it pulls the digests from tokens and populates token_index — no plaintext needed.

Why this design?

The reason not to persist tokens in a separate index file is atomicity. Keeping "record file update + index file update" atomic requires additional locking or another WAL. Embedding the tokens inside the record means "append the record and the index data comes along for free" — no separate atomicity mechanism needed, and the rebuild is always consistent.

Controlling index behavior through the schema: FieldIndex

Which fields get indexed, and in what form, is decided by the schema declaration. The FieldIndex enum defines three search shapes.

crates/quipu-core/src/schema.rspub enum FieldIndex {
    None,                // no index (default)
    Exact,               // one token: the lowercased full value (case-insensitive exact match)
    Prefix(usize),     // prefix tokens of length 1..=n (Prefix(4) → "a","al","ali","alic")
    Ngram(usize),      // sliding-window tokens of width n (Ngram(3) → "ali","lic","ice")
}

Prefix(n) serves prefix searches up to n characters purely from the index, with no scan. Ngram(n) narrows substring searches to a candidate set for queries of n or more characters. In both cases the token digests are persisted with the record and remain usable after a restart.

Caution

A FieldIndex declaration is permanent. The tokens baked into existing records match the shape declared at the time they were written. If you later change Prefix(4) to Prefix(8), past records will not respond to 5–8 character prefix searches. The code enforces this: define_type() returns an error if a field's search setting changes.

Recap

Unlike a DB's B-tree secondary index, Quipu-Log uses an in-memory HashMap as its index, rebuilt on every restart by replaying the registry log.
Token digests for prefix and substring searches on protected fields are persisted inside the record at write time (RegistryRecord.tokens), so they can be rebuilt without the original plaintext.
The FieldIndex::Exact / Prefix(n) / Ngram(n) schema declaration determines what tokens are generated at write time.
The cryptographic details of blind indexes — how token digests are generated, information-leakage analysis — are covered in Ch. 26 Blind Indexes.

Check yourself

① A DB persists its index to a separate file; why does Quipu-Log rebuild the index from scratch on every restart instead? What precondition makes this practical?
② If you declare FieldIndex::Prefix(4) on a name field protected with SHA-256, what happens at write time? Why can you still search for "Alic" after a restart?
③ Explain why a FieldIndex cannot be changed later, using "what data does a past record actually contain?" as your starting point.