Welcome back to the permissioned data diary! Quick recap of where we’ve been: post one was about why we’re not doing E2EE. Post two was about why permissioned data needs a shared context with a perimeter, a new protocol level detail that we’re calling a “bucket”.
I gave a teaser at the end of the last post:
a bucket doesn’t necessarily imply a physical container sitting on one PDS
This post is going to dive into exactly that: where the data in a bucket actually lives and where applications in the network fetch it from.
Some setup
Let’s carry on our example from last time. AtmoBoards is a forum app, and Alice wants to create a private community forum on there called “protocol nerds”. Alice creates a bucket to represent the forum and invites a few thousand other users including Bob and Carol.
There are two fundamentally different ways you could design how this bucket is stored and synced. For this post I’ll be calling them colocated and partitioned.
A colocated bucket lives on the owner’s PDS. All writes from all members get sent to Alice’s PDS. All reads go through her PDS, and all apps sync from her PDS. Everything flows through one place. In other words everything in the bucket gets “colocated” on Alice’s PDS. Bob and Carol might keep copies of their own data on their PDSes, but those are essentially just backups. The authoritative source of truth for bucket contents is Alice’s PDS.
A partitioned bucket on the other hand exists as an abstraction rather than being localized on a single server. It exists across all of its members’ PDSes. Alice’s PDS (being the owner) may still play a special role in administering the bucket. But Alice, Bob, and Carol each store their partition of data for the bucket on their own PDS. Apps that want to sync the bucket have to go talk to each member’s PDS and stitch it together. The intuition for this is to think of how a post thread in Bluesky works. The posts are not colocated at the protocol level; they are constructed into a “thread” by virtue of the fact that they all reference the same root post.
The wrong kind of simple
When you first think of a bucket, or even a group/forum for that matter: it feels like a thing that exists somewhere. The model of a colocated bucket leans into that intuition. It’s the first instinct & null hypothesis.
Colocated buckets are also simpler technically. There’s one place to go to sync a bucket; you don’t have to run around to every single member’s PDS and sync each of their partitions of the bucket separately. We can have a single logical clock or cryptographic commitment on a bucket to help with sync rather than splintering that state over all members. There’s a single point of enforcement for access control. All reads and all writes go to the bucket owner who knows exactly who has access to the bucket. Without the bucket owner as authorization enforcer, this role gets diffused throughout all participating member PDSes and applications.
However this technical simplicity masks a deeper confusion and complexity around the roles in the network and their responsibilities. Specifically, the role of Alice’s PDS changes from being responsible for Alice’s data to being responsible for the entire bucket’s data. Hosting data on behalf of every person on her forum is a completely different kind of commitment.
This grates against our atproto ethos. But it isn’t just philosophically awkward, it cascades into some pretty gnarly real-world problems.
Resource usage
To start with, Alice’s PDS has to store everything for the forum! Every post from every member. For a small group of friends that’s fine. For a community of 100k people, Alice’s PDS is now running infrastructure and storing data at a scale that has nothing to do with Alice’s own usage.
This problem is even worse if Alice is a self-hoster. Her PDS’s job just radically changed. She now has people she doesn’t know, posting content that she can’t control directly into her PDS. She didn’t sign up for this! She doesn’t want to be on the hook for thousands of users’ posts and likes and images. She just wanted to create a forum for her protocol nerd friends.
Moderation
In atproto, applications generally take on the responsibility for network-wide moderation. PDSes may do some moderation as well, but PDSes never need to moderate anything on the network not generated by an account hosted on that PDS. With colocated buckets, Alice’s PDS is suddenly on the hook for hosting and serving data for people that it has no relationship with - no account, no email or method of contact, and no Terms of Service.
Blobs/Media
As a special case of resource usage & moderation, consider blobs/media. If Alice’s PDS is hosting the bucket, it’s hosting Bob’s uploaded images too. Now Alice’s PDS has copyright exposure, CSAM risk, etc. Bob can upload giant files that affect Alice’s storage quote, not his.
We could say colocated buckets are only for record data and blobs are always stored on the author’s PDS. However this gets us back to the same difficulties we encounter with partitioned buckets though perhaps at a lower magnitude.
Takedowns
When Bob gets taken down at the hosting layer, his PDS’s job right now is simple: stop serving his content. For colocated buckets, it becomes: fan out a removal request to every bucket Bob participates in, have those PDSes honor it, and coordinate re-instatement if Bob migrates to a new PDS. Not only can that get pretty complicated, but it’s a pretty strange and awkward thing for Bob’s PDS to do. Its responsibility transitions from “deciding whether to provide a service to Bob” to “actively coordinating data removal across the network”.
Operational dependency
Colocated buckets require the introduction of an operational dependency on the owner’s PDS. When Bob posts, he has to send that post to Alice’s PDS. If Alice’s PDS is down, what happens? One option is that the post fails. Now Bob is experiencing failures because of a service that he has no real relationship with. The other option is that the post buffers on Bob’s PDS and is sent to Alice’s PDS when it comes back online. This now requires a more complicated PDS<>PDS sync mechanism (at the least, an outbox). Even in this case, the group will receive no updates until Alice’s PDS comes back online.
This is all difficult to communicate to users. A user has a relationship with their app and their PDS, but they have no real relationship with a bucket owner’s PDS, and now their experience of using their app is downstream of that service.
Trust
Say Carol wants to read a post from Bob on the forum. In the colocated model, her app fetches it from Alice's PDS, a service she has no relationship with. Carol trusts her app to some degree, and she probably trusts Bob's PDS to accurately represent Bob's content. But Alice's PDS is a total stranger in this transaction. Colocated buckets introduce an extra “hop” on the way from the author’s PDS to the application, specifically a hop that a reader doesn’t necessarily trust.
The way we handle these untrusted hops in the public protocol is to asymmetrically sign data. However, I strongly consider asymmetric signatures on permissioned data to be an anti-pattern (about which I’ll have more to say in a later post). But even if we were to sign it, signing data and storing it on a wide variety of other PDSes has the downstream effect of dramatically increasing the complexity of key rotation and therefore account migration.
The right kind of complex
(does that header inspire confidence?)
In contrast, partitioned buckets are more complex technically but also more honest in how they divvy up responsibilities. They are much closer to our “atproto ethos”. Users own their data. It lives on their PDS. Applications crawl it.
The whole model of the atmosphere is that apps construct views by aggregating records from a fluid and universally addressed hosting layer made up of many PDSes.
Partitioned buckets are that same model, with access control layered on top. An app that wants to sync the Protocol Nerds forum goes and talks to Alice’s PDS, Bob’s PDS, Carol’s PDS, and all the others and stitches together the view. This is more work for the app. But it also mirrors how atproto works for public data.
The tradeoff is the complexity. Apps need to maintain relationships with many PDSes. Sync state is harder to reason about. And access control starts to rear its ugly head.
These are hard problems. But they are also engineering problems. They’re the kind of hard that a protocol can hopefully solve. To my eye, the problems with colocated buckets are architectural. They’re breakdowns in who’s responsible for what and violations of trust assumptions at the core of atproto. I don’t think they’re things we can engineer our way out of.
What's next?
What’s next is going to be about trying to wrangle this complexity! I have a few topics in mind:
Bucket & record addressing: How do we name and locate buckets? How do we resolve a permissioned data URI to a record?
Access control: How do buckets track who can read and write? How do apps prove they’re allowed to read?
Sync protocol: How does the data actually get from a PDS to the app? If we’re not signing it, what are we doing? If we’re not using an MST, what are we using?
Just on a personal note, I’ve been enjoying putting these out and hearing thoughts and feedback from everyone. Big thanks to everyone that’s been following along. If it seems like we don’t have all the answers yet, well… we don’t. I’m writing these more or less as we get confidence in decisions. But things are slowly starting to shape up and I think we’re starting to circle around a pretty complete picture for a permissioned data protocol.
Next week (ahead of AtmosphereConf), I’ll put out a special edition of the permissioned data diary that includes a low-resolution sketch of that complete picture. It won’t be as in-depth as the previous posts or try to explain how we reached certain decisions. Instead, it’ll serve as a high-level signpost and should give us something to chat about in Vancouver.
Stay tuned!