Meri Leeworthy

ATProto

Bluesky and the AT Protocol - Usable Decentralized Social Media

System Architecture

Merkle Search Tree

Merkle Search Trees

Are used for…

1. Representing Repos

Each ATProto user has a personal repo, a log of their content (posts, profile info, likes, etc.). This repo is:

So the MST represents the current state of the repo, mapping keys like:

app.bsky.feed.post/3jkxga4k3...
app.bsky.actor.profile/self

to content records.

2. Efficient Sync

When two peers (or a client and PDS) want to sync:

This is much faster and lighter than sending full snapshots.

3. Inclusion Proofs

Because the MST is a Merkle tree, you can do:

This is critical for verifiable data hosting, especially if PDSs or clients don’t trust each other fully.


🧠 Implementation Details


🧪 Practical Example

When you post something:

  1. The post is assigned a URI like: at://did:plc:xyz/app.bsky.feed.post/abc123
  2. A record is created and added to the repo
  3. The MST is updated to include this new key
  4. A new commit is created with a pointer to the new MST root
  5. The commit is pushed to your PDS or other peers

📦 Summary

FeatureRole of Merkle Search Tree in ATProto
Data organisationMaps record keys to data efficiently
IntegrityEach commit signs the root hash of the MST
SyncAllows diffing and fetching only changed data
ProofsEnables verifiable inclusion/exclusion
If you’re interested in the actual implementation, it’s written in TypeScript here:
👉 atproto/packages/repo/src/mst

is this taking into account changes as of sync v1.1 https://github.com/bluesky-social/proposals/tree/main/0006-sync-iteration ?

Good catch. The explanation I gave was mostly based on Sync v1. The new Sync v1.1 proposal (from the ATProto proposal repo) does change the role of Merkle Search Trees in a few key ways. Here’s a concise breakdown of what has changed and what remains:


🔧 What’s changed in Sync v1.1 (ATProto Proposal 0006)?

1. ==MSTs are no longer used for sync diffs==

Quote from proposal:
“Rather than performing a diff between MSTs, we instead stream the log of records, with the minimal set of records needed to represent the difference between two commits.”

2. Sync now uses a new syncGetRepoUpdates endpoint

3. No longer strictly ordering by key


🪵 So what’s the role of MSTs now?

✅ Still used:

❌ No longer used:


🧠 Summary of Sync v1.1 Changes

FeatureSync v1Sync v1.1
Sync diffingMST diffCausal stream of records
OrderingBy keyBy causal write order
Inclusion proofsMST-basedStill MST-based
State verificationMST root hash in commitStill true
Key data structureMSTStill used, but not for syncing

💡 Implication

If you’re building anything that consumes or syncs repos:

Let me know if you want help parsing the syncGetRepoUpdates implementation or wiring it up to a CAR file stream parser.

Q

and the streaming is from PDSs, right? is this endpoint defined on the PDS?

Yes — the syncGetRepoUpdates endpoint is implemented on the PDS (Personal Data Server).

✅ Confirmed in Sync v1.1

The syncGetRepoUpdates endpoint:

From the spec (proposal 0006):
“The PDS exposes a syncGetRepoUpdates endpoint that streams a minimal CAR file representing the causal difference between two commits.”

This means:


🧱 So in short:

ComponentRole
syncGetRepoUpdatesXRPC method defined on PDS
ResponseCAR stream of DAG nodes (records, blocks)
Diffing logicMoved from client-side MST diffing → to causal commit traversal in the PDS
MSTStill in each commit (for verifiability), but not used for diffing

If you’re writing a sync client or mirror service, you now primarily need to:

  1. Track heads (commit CIDs),

  2. Call syncGetRepoUpdates,

  3. Parse the returned CAR file and apply changes.

Let me know if you want an example of using this endpoint or building a minimal sync client.

Q

and how do they calculate the set difference?

Great question — calculating the set difference (i.e. “what changed between commit A and commit B?”) in Sync v1.1 no longer involves walking Merkle Search Trees. Instead, it’s done by traversing the commit DAG between the two commits.

Here’s how the PDS calculates the difference:


🔁 How the PDS Calculates the Set Difference

Given two commit CIDs: from and to, the PDS:

  1. Finds the common ancestor commit of from and to in the repo’s DAG.

    • The repo is a causal DAG (similar to Git’s commit graph).

    • Finding the ancestor is similar to a lowest common ancestor problem.

  2. Traverses the DAG forward from the ancestor to to.

    • This traversal gives a topological order of all commits that occurred after the ancestor.

    • This guarantees causal ordering of record mutations.

  3. Accumulates record-level changes from those commits.

    • Each commit contains a set of record mutations (creates, updates, deletes).

    • These are bundled into a CAR file stream in the correct order.

  4. Returns a CAR stream that:

    • Includes the blocks (records, commits) needed to reconstruct the repo state at to,

    • But only includes the minimal necessary changes from from to to.


🧱 Data involved

Each commit block contains:

These mutations are used to build the final stream of changes.


💡 So the “set difference” is:


🧪 Example

Say we have:

A --- B --- C --- D
           \
            E --- F

If you request updates from C to F:


✳️ Bonus: Why this is better than MST diffing


Let me know if you want a diagram or code walkthrough of this traversal logic — it’s conceptually close to a topological sort on a DAG plus collecting node payloads.

I live and work on the land of the Wurundjeri people of the Kulin Nation. I pay respect to their elders past and present and acknowledge that sovereignty was never ceded. Always was, always will be Aboriginal land.

This site uses open source typefaces, including Sligoil by Ariel Martín Pérez, and Vercetti by Filippos Fragkogiannis