The Quipu-Log Book
Part 6 · Confidentiality: searching while keeping secrets (Security II)

26 · Blind indexes: searching without plaintext

You've stored an encrypted SSN. Now you need to find all records belonging to a patient with a specific SSN. With no plaintext on disk, a conventional search is impossible. But decrypting every record to compare is slow — and the moment you do, plaintext surfaces in server memory, which defeats the point. The answer to this dilemma is the blind index. Let's look at how it makes searching possible without leaving plaintext on disk.

In one sentence

A blind index derives search tokens from the plaintext and converts them to digests at write time. A server that never sees the plaintext can apply the same computation to a query term and check for a match.

What databases do — the encryption-and-search dilemma

In a relational DB, encrypting a column breaks WHERE ssn = '123-45-6789'. The DB needs to compare stored values against the query, but either the encrypted values change on every write (non-deterministic encryption) or decryption is required before comparison — which forces the DB server to see plaintext, breaking insider-threat protection.

Some DB extensions (pgcrypto, Always Encrypted) use deterministic encryption to allow indexing: same input → same ciphertext. But deterministic means identical values produce identical ciphertexts, which opens the door to frequency analysis. Quipu-Log's blind index was designed with that attack surface in mind too.

DB ↔ Filesystem

In a DB, column encryption and a search index are hard to have at the same time — it's a choice between deterministic encryption (= frequency leakage) and non-deterministic encryption (= no search). Quipu-Log's blind index tokenizes the plaintext and stores digests of those tokens, allowing partial-match search while minimizing the surface area for frequency analysis.

The idea behind blind indexes

The core insight is simple. Comparisons at search time are done between digests of tokens, not original plaintexts. Apply the same transformation at write time and at search time, and you can confirm a match without ever having the plaintext.

Write path (upsert) plaintext "Alice Kim" normalize "alice kim" tokenize "ali","lic","ice"... HMAC digest "a3f9…","2b7c…"... store tokens RegistryRecord.tokens disk no plaintext ✓ Search path (query) query term "Alice" normalize "alice" prefix tokens "a","al","ali","alic","alice" HMAC digest same key, same computation posting list intersect → candidates compare digests same normalize → tokenize → HMAC applied at write time and query time = match without plaintext domain separation: "idx:" + field name + NUL + token → never collides with body digests
On write: plaintext → normalize → tokenize → HMAC digest stored. On search: apply the same pipeline to the query term, then compare digests. Plaintext never touches disk.

Three tokenization modes — Exact / Prefix(n) / Ngram(n)

The type of token you produce determines what kinds of search you can support. Quipu-Log's FieldIndex offers three.

crates/quipu-core/src/schema.rs — FieldIndexpub enum FieldIndex {
    None,
    Exact,       // lowercased whole value as one token → case-insensitive exact match
    Prefix(usize), // all prefixes of length 1..=n → prefix search up to n chars
    Ngram(usize),  // sliding window of n chars → substring search (n=3 is typical)
}

The tokenization logic works on char boundaries, so multi-byte characters like Korean and Japanese are handled correctly.

crates/quipu-core/src/tokens.rs — value_tokenspub(crate) fn value_tokens(text: &str, idx: FieldIndex) -> Vec<String> {
    let norm = normalize(text); // to_lowercase()
    let chars: Vec<char> = norm.chars().collect(); // char-based — multi-byte safe
    match idx {
        FieldIndex::Exact => { out.insert(norm); }
        FieldIndex::Prefix(n) => {
            for len in 1..=n.min(chars.len()) { out.insert(chars[..len].iter().collect()); }
        }
        FieldIndex::Ngram(n) => {
            for w in chars.windows(n) { out.insert(w.iter().collect()); }
        }
        _ => {}
    }
}

For example, indexing "Alice Kim" with Prefix(3) produces three tokens: ["a", "al", "ali"]. Indexing "김철수" with Ngram(2) produces ["김철", "철수"].

Domain separation — keeping index tokens from colliding with body digests

When HMAC-ing an index token, the input is prefixed with "idx:" + field name + NUL. This is domain separation.

Why this design?

Without a domain prefix in the HMAC input, the HMAC of the value "123-45-6789" for the "ssn" field and the HMAC of the index token "123-45-6789" for that same field would be identical. An attacker could use an index token to impersonate a field body digest, or vice versa. Domain separation guarantees that the two kinds of digests can never collide.

crates/quipu-core/src/crypto.rs — index_token_digest_withlet mut data = Vec::with_capacity(4 + field.len() + 1 + token.len());
data.extend_from_slice(b"idx:");           // domain separation
data.extend_from_slice(field.as_bytes());   // field name
data.push(0);                               // NUL separator
data.extend_from_slice(token.as_bytes());   // the actual token

Comparing the three index modes

FieldIndex Tokens produced (e.g. "carol") Search supported False positives?
Exact "carol" — one token case-insensitive exact match (ExactCi) none
Prefix(3) "c", "ca", "car" prefix search (up to 3 chars) none (token = exact prefix)
Ngram(3) "car", "aro", "rol" substring search possible on Hmac/Sha256 fields
Caution

Using an n-gram index on a Sha256 or Hmac field can produce false positives. Consider indexing "abcd" and "cdef" with 3-grams — both share the token "bcd". Searching for "abcdef" returns both as candidates. But since there's no plaintext, there's no way to confirm that "abcd" doesn't actually contain "abcdef". With an Rsa field and n-gram, you can decrypt and verify, so false positives are eliminated.

Where are blind indexes stored?

Token digests are persisted in RegistryRecord.tokens. This is why search still works after a server restart — because the plaintext is gone, tokens can't be regenerated on startup, so storing them at write time is essential.

crates/quipu-core/src/registry.rs — RegistryRecordpub struct RegistryRecord {
    // ...other fields...
    pub tokens: BTreeMap<String, FieldTokens>,  // field name → list of token digests
}

pub struct FieldTokens {
    pub key_version: u32,    // which HMAC key version produced these
    pub digests: Vec<String>, // per-token HMAC digests
}

At search time, an in-memory posting list (token_index) narrows the candidates quickly. For an n-gram search, the candidates are entity IDs whose posting list entries contain all of the query's n-gram digests. The posting list is rebuilt from the registry records on server startup.

Offline attacks are impossible without the key

When a field's protection is Hmac or Rsa, index token digests are also produced with the HMAC key. So even if an attacker takes the token digests off disk, they can't reverse-engineer which token produced a given digest without the HMAC key.

Security note

The information leakage that a blind index adds is structural — it reveals things like "these two records have the same first three characters of their name field." What those characters actually are remains hidden without the key. Whether that structural leakage is acceptable is a design-time decision — declaring a FieldIndex is an explicit contract that says "this field will expose this much structure in exchange for searchability."

Recap

  • A blind index derives plaintext → tokens → digests at write time and stores them. Plaintext never reaches disk.
  • At search time, the same transformation is applied to the query term and digests are compared — match confirmed without plaintext.
  • Exact for case-insensitive exact match, Prefix(n) for prefix search, Ngram(n) for substring search.
  • Hmac/Sha256 + Ngram can produce false positives — Rsa + Ngram can confirm via decryption.
  • Index tokens are also protected by domain-separated HMAC — offline attacks without the key are impossible.
Check yourself

① What tokens are produced when you index "alice" with Prefix(3)? How many token digests does a search for "Ali" look up?
② Explain with a concrete example why false positives can occur with a Ngram(3) index.
③ When you declare a blind index on a field, what structural information are you choosing to expose? Think it through.