Do you remember the good old times of Twitter? The data you could fetch data through the API in real-time, allowing people to build tools on top of it. These times are back. Now, with Bluesky, you can do the same.
What is Bluesky? Bluesky is a social network like Twitter and Threads, but unlike them, it is fully open-source. It is growing by 1 million new users daily, and we can all follow along with the numbers and create new tools. In this article, we do exactly that and follow what the community is building.
What is Bluesky
Not only is Bluesky, the social app for web, Android, and iOS, open source, but also the protocol that everything is based on is called ATProto. If Bluesky goes down, the protocol and your posts/data stay, and the new UI can be built. Two alternative UIs are already built on top of ATProto: Frontpage, an alternative Hackernews, and Smoke Signal, an RSVP management app.
These don’t utilize all the features ATProto provides, but specific information about the user and information that helps the app serve its particular purpose. You can also start cross-using or displaying information from the protocol. For example, you could show posts with a specific hashtag or people from a particular area for each meetup. The use cases are endless.
How does it work?
Another feature that Bluesky and ATProto have is decentralization. Bluesky revolutionized this with the ATProto. Although, by default, the content is hosted on the Bluesky Personal Data Server (PDS) server, everyone can host their content on their server, and the interface is your handle, the same as it was with the web.
Interestingly, this approach is a way back to the old web with more power to the people away from prominent social media companies controlling everything. Dan illustrates this best in his video about Web Without Walls, showcasing it with blogs you own, interlinked to other blogs and websites from your server to the other. Today, centralized social media platforms host and own all your content on their servers; without them, your content is lost, too.
Illustration going from websites to centralized social media platforms to a decentralized AT Protocol.
The decentralization and hosting of your server are achieved with the so-called Personal Data Server (PDS), which is also open-sourced. Interestingly, each user’s data is implemented and stored with a single SQLite database. This means there are around 19 million as of now, but when you run your own, you could implement it with any backend, e.g., DuckDB. π
Check all your artifacts, such as posts, likes, etc., on the ATProto Browser, such as the events mentioned above or Frontpage interactions. E.g., for my handle, this looks like this:
Philosophy and Working Without a Massive Algorithm
Before we get into some code examples, here is a quick note on the philosophy behind Bluesky and how it differs from Twitter, Instagram, and LinkedIn. Instead of one colossal algorithm deciding what we see and what not, Bluesky works based on people and feeds. The feeds are created by someone provided by Bluesky (e.g., popular with friends, quiet posters, likes of likes, etc.), but you can also create your own.
This way, you are in control of what you see. The “Discover” feed is closest to other social media algorithms.
Coding Time: Discover the Open APIs and Streams
Let’s have some fun.
Not only is everything open-source but the APIs and Jetstreams (streams of posts, likes, etc.) can also be queried for free. Let’s explore some hands-on examples.
Reading Posts with DuckDB Directly
To illustrate, you can simply read the post with DuckDB - e.g. reading my last 5 posts
|
|
To find the unique Bluesky-ID (the handle is just a friendly name) that you need for the above query we can open this GET request (<– change the handle in the URL), or we can do it with DuckDB with the community extension http_client:
|
|
Just replace my handle (ssp.sh) with yours.
Most Engagement with the Latest 100 Posts
Or read the most engaging posts with this query (full query here) and plot a little bar chart that comes with DuckDB included:
|
|
Full query see Querying Bluesky with DuckDB and SQL
That looks something like this:
Note: The API limit is around 100, so if you want more than 100, you’ll need to paginate or write code.
Using Python for interacting with the AT Protocol
Data people have formed around the hashtag #datasky or #databs. If we want to read these streams, we can use Python and the Python SDK. For example, publishing a “Post” can be simply done with:
|
|
It looks like this in real life - see the post here - try it, it’s fun π
A Firehose or Live Stream of Posts
If you want all messages, you can subscribe to the stream with this Python code: firehose.py. It will stream everything and looks like this:
If you want a stream dedicated to these two hashtags around data #datasky and #databs use hashtag_databs.py, which catches all posts sent with the above hashtag, e.g., below my test post:
Streaming and Uploading to #databs to MotherDuck
I also created streaming_into_motherduckdb.py that lists both hashtags, writes them to parquet files and uploads them to a public DuckDB database hosted on MotherDuck. If you create an account for free, you can query my shared DuckDB database with ATTACH 'md:_share/bsky/c07e1ca0-6b51-4906-96cd-b310ec35e562' as md_bsky and query a couple of posts I uploaded test-wise:
|
|
You could do the same within MotherDuck’s platform and make use of the visualization features and the benefits of the collaborative notebook approach.
You can also use Jake’s great collection, where he shares the Jetstream as Cloudflare R2 to query openly with DuckDB:
|
|
It also works in the browser - check it here DuckDB Wasm β DuckDB:
Image by Jake
What are people building?
There are currently many collaboration efforts going on, and hourly, new things are shared among the new, friendly Bluesky community. Many people try to help each other and build the best data tooling around Bluesky and ATProto. Here is the one I came across lately (I’m sorry if I forgot anyone):
- David is building on atproto-data-tools: π¦ Small scripts and tools to do data stuff with the AT Protocol.
- JavaScript implementation: Consuming the firehose for less than $2.50/mo
- Jake Thomas providing the first R2 catalog, see his post
- Victoriano is visualizing the post in a network graph with Graphext. David did a subset for
#databsanddataskyhere - Bluesky examples with Python: atproto/examples
I hope we can work together collaboratively and build the best Bluesky tools for data people. If not us, then who? π
Also, there are many more tools around charting, browsing starter packs, etc. Currently, I collect and update them regularly on :
- Charting, User Stats, network analyzer, directory, and many more: Β Bluesky Tools
- ATProto related like export posts to s3, a TUI, and many more: Β AT Proto Tools
Full article published at MotherDuck.com - written as part of my services
