Title: “FROST: Federated Registry Of Scientific Things”
Speaker: Tom Nicholas (ORCID: 0000-0002-2176-0530)
When: Wednesday, February 12, 2025 at 4 PM EST
Context: The easiest way to store and provide access to big scientific datasets is via ARCO data in S3-compatible cloud object storage. We now have scalable cloud-optimised formats that are version-controlled at rest in object storage (particularly Icechunk for arrays and Iceberg for tables). This is huge, as even dynamically-updated datasets can now be distributed via raw S3, with no other server needed. All the data providers who are paying attention are about to put their data in these formats, but then they will try to advertise the S3 URLs to the world via ad-hoc data catalogs.
Problem: Everyone’s catalogs are disconnected from everyone else’s.
This means:
- No cross-org discoverability (e.g. NASA catalog users won’t see NOAA datasets or vice versa).
- No cross-org tracking of updates (e.g. NOAA datasets derived directly from NASA datasets won’t automatically know if the NASA datasets have been updated upstream).
- Risk of “catalog wars” where platform services compete to make more and more comprehensive “meta-catalogs” which merely track (outdated) links to other orgs’ data.
- Risk that if one platform does win everyone might feel locked in to it via the social network effect.
Solution: Federated catalog protocol with cross-org publish-subcribe model.
- Cross-org discoverability enabled via displaying the contents of the dataset entries being broadcast,
- Cross-org tracking of updates to datasets enabled the same way,
- No need to compete to make a better catalog, as anyone can easily consume and display the entire global catalog, including updates,
- Federated trust model allows proliferation of high-quality centralized services, whilst also guarding against platform lock-in.
How do we build it?: Not sure exactly, but the problem is analogous to creating Federated alternatives to centralized social media (i.e. Bluesky/Mastodon vs Twitter). Perhaps we can piggyback off of Bluesky’s ATproto or Mastodon’s ActivityPub?
This showcase will take the form of a short talk followed by a moderated community discussion.