10 · Serialization: turning structs into bytes

If frames handle boundaries and integrity, what goes into the payload they carry? The job of turning a Rust struct into a byte array — serialization — is what this chapter is about. We'll cover which format to use and why not JSON, and then explain why a separate "canonical byte representation for hashing and index keys" is necessary.

In one sentence

Struct → bytes uses bincode; the canonical representation for hashing and index keys is managed separately via canonical_bytes(). The reason JSON-shaped values are stored wrapped in a String comes down to bincode's non-self-describing nature.

What you already know: per-type encoding in a DB

When a relational database executes INSERT INTO logs (id, timestamp, message) VALUES (1, now(), 'hello'), the engine encodes INTEGER as fixed bytes (4 bytes), TIMESTAMP in an internal time format, and TEXT as length + data — all written directly to a page. That encoding is buried inside the engine; you never have to think about it.

We don't have a DB engine. We have to decide "struct → bytes" ourselves.

DB ↔ Filesystem

In a DB, on-disk type encoding is baked into the engine (Postgres's heap tuple format, MySQL's InnoDB row format, etc.). In Quipu-Log, a serde + bincode combination converts Rust structs to bytes — this is the part we have to own when we're working with files instead of an engine.

Why bincode — comparison with JSON

JSON is the first thing that comes to mind for serialization. It's human-readable, familiar, and well-supported. Yet Quipu-Log uses bincode. Why?

	JSON	bincode
Human-readable	Yes	No (binary)
Size	Large (field names repeated, quotes, commas)	Small (no field names, type-inferred)
Speed	Slow (parsing overhead)	Fast (close to memory layout)
Type information	Self-describing (key-value)	Non-self-describing (position-based)
serde integration	Yes	Yes

Audit logs accumulate hundreds of thousands of records. If each record is 100 bytes smaller, the impact on disk usage and throughput is real and measurable. The downside of bincode (non-readability) isn't a big deal here — segment files aren't meant to be opened directly; you read them through the query API.

Analogy

Think of shipping boxes. Labeling every item inside the box on the outside (JSON) is great for human inspection but wastes space. Putting items into numbered slots in a defined order (bincode) works because both sender and receiver have agreed "slot 3 always holds the timestamp" — no label needed.

serde: the glue layer

bincode itself doesn't know how to serialize any particular struct. That's serde's job. Attaching #[derive(Serialize, Deserialize)] causes the Rust compiler to auto-generate serialization and deserialization code. bincode uses that interface to write binary.

crates/quipu-core/src/model.rs#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AuditLog {
    pub log_id: Uid,
    pub timestamp: u64,
    pub actor: Uid,
    pub actor_type: String,
    pub method: String,
    pub url: String,
    pub content: Content,
    pub custom: BTreeMap<String, Value>,
}

Thanks to the derive macro, a single bincode::serialize(&log) call yields a byte array. Going the other way, bincode::deserialize(&bytes) reconstructs the struct. That's all the storage engine needs to do.

The non-self-describing trap: Json is stored as a String

bincode doesn't record field names — it relies on the struct definition's field order instead. That's the source of its size and speed advantage, but it comes with one trap.

serde_json::Value is a type that can be any JSON type. It might be {"key": 1}, or it might be [1,2,3]. Because bincode is non-self-describing, it can't decide at runtime "is what comes next an integer or an array?" — it uses the struct definition to decide. But the shape of serde_json::Value is only known at runtime.

Quipu-Log's solution is elegant.

crates/quipu-core/src/model.rs// Public type: callers pass Value::Json(serde_json::Value)
#[serde(try_from = "ValueRepr", into = "ValueRepr")]
pub enum Value {
    Text(String),
    Number(f64),
    Json(serde_json::Value),
}

// On-disk representation: Json is converted to a String for storage
enum ValueRepr {
    Text(String),
    Number(f64),
    Json(String), // <-- String instead of serde_json::Value
}

serde(try_from, into) separates the public type (Value) from the on-disk representation (ValueRepr). A JSON value is stringified with .to_string() before storage, then parsed back with serde_json::from_str on read. The part bincode struggles with is simplified down to a String.

The same pattern applies to Content (the log body).

canonical_bytes: the canonical representation for hashing and index keys

Separate from storage serialization, there's a need for a "byte representation of a value for comparison or hashing." For example, when a field is protected by SHA-256, a query checks "SHA-256 of probe value == stored SHA-256." Both the probe value and the stored value must be converted to bytes by the same method to produce the same hash.

Can we reuse the bincode serialization? There's a problem — bincode serializes the entire struct, but here we want to represent a single field's value as bytes. bincode is also sensitive to struct layout changes.

So Value has a dedicated method.

crates/quipu-core/src/model.rsimpl Value {
    /// Canonical byte representation used for hashing and index keys.
    pub fn canonical_bytes(&self) -> Vec<u8> {
        match self {
            Value::Text(s)   => s.as_bytes().to_vec(),
            Value::Number(n) => format!("{n}").into_bytes(),
            Value::Json(v)   => v.to_string().into_bytes(),
        }
    }
}

Simple but important as a contract. A Text value of "hello" is always exactly 5 UTF-8 bytes; a Number of 42.0 is always the UTF-8 bytes of "42". As long as that contract holds, searching a SHA-256-protected field works correctly. Ch. 26, blind indexes covers in detail how canonical_bytes feeds into token generation.

The same Value produces two distinct byte representations. Storage (bincode) and hashing/indexing (canonical_bytes) have different purposes, so they stay separate.

StoredValue: the on-disk representation with protection applied

Registry fields can be stored as plaintext, a SHA-256 digest, an HMAC, or RSA ciphertext, depending on the schema. The type that holds all of these is StoredValue.

crates/quipu-core/src/model.rspub enum StoredValue {
    Plain(Value),
    Sha256(String),
    Hmac { key_version: u32, digest: String },
    Rsa  { key_version: u32, wrapped_key: String, nonce: String, ciphertext: String },
}

This also implements Serialize + Deserialize, so it gets stored in segments via bincode. Which protection scheme was used is encoded in the enum variant, so deserialization doesn't need to look up the schema separately. Part 6, confidentiality covers the cryptographic meaning of each variant in depth.

Format stability: what happens when you change a struct?

Bincode's non-self-describing nature is both a strength and a caution. Add a field to a struct, or reorder fields, and existing segments become undeserializable. That's one of the reasons Quipu-Log embeds FORMAT_VERSION in the segment header — when the format changes, the version is bumped to clearly mark older files as "unreadable." Ch. 18, storage layout and format versioning

Why this design?

Self-describing formats (MessagePack, CBOR) store field names alongside data, which allows partial deserialization even when the struct changes. Choosing bincode was a size and speed first decision. Audit logs are write-heavy with high record counts, so saving tens of bytes per record has a meaningful impact on total disk usage and throughput. Format stability is managed via FORMAT_VERSION — a deliberate acceptance of the trade-off.

Check yourself

① What does it mean to say bincode is "non-self-describing"? How does that differ from JSON?
② Why isn't Value::Json(serde_json::Value) stored directly with bincode, but instead converted to a String first?
③ Do canonical_bytes() and bincode::serialize always produce the same bytes for the same value? Explain the difference in their purposes.