Contents

Building with Bluesky: Inside the New Open Social Network

How to Extract Analytics from Bluesky based on AT Protocol

Building with Bluesky: Inside the New Open Social Network
This article was written as part of my services

Do you remember the good old times of Twitter? The data you could fetch data through the API in real-time, allowing people to build tools on top of it. These times are back. Now, with Bluesky, you can do the same.

What is Bluesky? Bluesky is a social network like Twitter and Threads, but unlike them, it is fully open-source. It is growing by 1 million new users daily, and we can all follow along with the numbers and create new tools. In this article, we do exactly that and follow what the community is building.

Live post visualized in 3D, made with Bluesky Firehose

What is Bluesky

Not only is Bluesky, the social app for web, Android, and iOS, open source, but also the protocol that everything is based on is called ATProto. If Bluesky goes down, the protocol and your posts/data stay, and the new UI can be built. Two alternative UIs are already built on top of ATProto: Frontpage, an alternative Hackernews, and Smoke Signal, an RSVP management app.

These don’t utilize all the features ATProto provides, but specific information about the user and information that helps the app serve its particular purpose. You can also start cross-using or displaying information from the protocol. For example, you could show posts with a specific hashtag or people from a particular area for each meetup. The use cases are endless.

How does it work?

Another feature that Bluesky and ATProto have is decentralization. Bluesky revolutionized this with the ATProto. Although, by default, the content is hosted on the Bluesky Personal Data Server (PDS) server, everyone can host their content on their server, and the interface is your handle, the same as it was with the web.

Interestingly, this approach is a way back to the old web with more power to the people away from prominent social media companies controlling everything. Dan illustrates this best in his video about Web Without Walls, showcasing it with blogs you own, interlinked to other blogs and websites from your server to the other. Today, centralized social media platforms host and own all your content on their servers; without them, your content is lost, too.


Illustration going from websites to centralized social media platforms to a decentralized AT Protocol.

The decentralization and hosting of your server are achieved with the so-called Personal Data Server (PDS), which is also open-sourced. Interestingly, each user’s data is implemented and stored with a single SQLite database. This means there are around 19 million as of now, but when you run your own, you could implement it with any backend, e.g., DuckDB. πŸ˜‰

Check out ATProto Browser to see all artifacts attached to the protocol

Check all your artifacts, such as posts, likes, etc., on the ATProto Browser, such as the events mentioned above or Frontpage interactions. E.g., for my handle, this looks like this:

Philosophy and Working Without a Massive Algorithm

Before we get into some code examples, here is a quick note on the philosophy behind Bluesky and how it differs from Twitter, Instagram, and LinkedIn. Instead of one colossal algorithm deciding what we see and what not, Bluesky works based on people and feeds. The feeds are created by someone provided by Bluesky (e.g., popular with friends, quiet posters, likes of likes, etc.), but you can also create your own.

This way, you are in control of what you see. The “Discover” feed is closest to other social media algorithms.

Coding Time: Discover the Open APIs and Streams

Let’s have some fun.

Not only is everything open-source but the APIs and Jetstreams (streams of posts, likes, etc.) can also be queried for free. Let’s explore some hands-on examples.

Reading Posts with DuckDB Directly

To illustrate, you can simply read the post with DuckDB - e.g. reading my last 5 posts

1
SELECT * FROM read_json_auto('https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?actor=did:plc:edglm4muiyzty2snc55ysuqx&limit=10')

To find the unique Bluesky-ID (the handle is just a friendly name) that you need for the above query we can open this GET request (<– change the handle in the URL), or we can do it with DuckDB with the community extension http_client:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
INSTALL http_client FROM community;
LOAD http_client;
 WITH __input AS (
    SELECT
      http_get('https://public.api.bsky.app/xrpc/com.atproto.identity.resolveHandle?handle=ssp.sh') AS res
  )
  SELECT
    res::json->>'body' as identity_json
  FROM __input;

identity_json                             
------------------------------------------
{"did":"did:plc:edglm4muiyzty2snc55ysuqx"}

Just replace my handle (ssp.sh) with yours.

Most Engagement with the Latest 100 Posts

Or read the most engaging posts with this query (full query here) and plot a little bar chart that comes with DuckDB included:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
...
SELECT 
  post_uri,
  created_at,
  total_engagement,
  bar(total_engagement, 0, 
      (SELECT MAX(total_engagement) FROM engagement_data), 
      30) as engagement_chart,
  replies, reposts, likes, quotes,
  post_text,

FROM engagement_data
WHERE handle = 'ssp.sh'
ORDER BY total_engagement DESC
LIMIT 30;

Full query see Querying Bluesky with DuckDB and SQL

That looks something like this:

Note: The API limit is around 100, so if you want more than 100, you’ll need to paginate or write code.

Using Python for interacting with the AT Protocol

Data people have formed around the hashtag #datasky or #databs. If we want to read these streams, we can use Python and the Python SDK. For example, publishing a “Post” can be simply done with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from atproto import Client, client_utils

def main():
    client = Client()
    profile = client.login(os.getenv('BSKY_USERNAME'), os.getenv('BSKY_PASSWORD'))
    print('Welcome,', profile.display_name)

    text = client_utils.TextBuilder().text('Hello World from ').link('Python SDK', 'https://atproto.blue')
    post = client.send_post(text)
    client.like(post.uri, post.cid)

if __name__ == '__main__':
    main()

It looks like this in real life - see the post here - try it, it’s fun πŸ™‚

A Firehose or Live Stream of Posts

If you want all messages, you can subscribe to the stream with this Python code: firehose.py. It will stream everything and looks like this:
asciicast
If you want a stream dedicated to these two hashtags around data #datasky and #databs use hashtag_databs.py, which catches all posts sent with the above hashtag, e.g., below my test post:

Streaming and Uploading to #databs to MotherDuck

I also created streaming_into_motherduckdb.py that lists both hashtags, writes them to parquet files and uploads them to a public DuckDB database hosted on MotherDuck. If you create an account for free, you can query my shared DuckDB database with ATTACH 'md:_share/bsky/c07e1ca0-6b51-4906-96cd-b310ec35e562' as md_bsky and query a couple of posts I uploaded test-wise:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
❯ duckdb
D ATTACH 'md:_share/bsky/c07e1ca0-6b51-4906-96cd-b310ec35e562' as md_bsky;
D from md_bsky.posts limit 5;
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         uri          β”‚         cid          β”‚        author        β”‚         text         β”‚      created_at      β”‚      indexed_at      β”‚ hashtag β”‚  langs  β”‚
β”‚       varchar        β”‚       varchar        β”‚       varchar        β”‚       varchar        β”‚       varchar        β”‚       varchar        β”‚ varchar β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ at://did:plc:6czr5…  β”‚ bafyreiddu2muv2yo5…  β”‚ bramz.bsky.social    β”‚ #databs, what Pyth…  β”‚ 2024-11-18T08:52:4…  β”‚ 2024-11-18T08:52:4…  β”‚ databs  β”‚ en      β”‚
β”‚ at://did:plc:edglm…  β”‚ bafyreiebsxxsgtzba…  β”‚ ssp.sh               β”‚ #databs test :)      β”‚ 2024-11-18T08:31:5…  β”‚ 2024-11-18T08:31:5…  β”‚ databs  β”‚ en      β”‚
β”‚ at://did:plc:jfda6…  β”‚ bafyreifizd4lxahgq…  β”‚ victorsothervector…  β”‚ (last thing before…  β”‚ 2024-11-18T07:48:1…  β”‚ 2024-11-18T07:48:1…  β”‚ databs  β”‚ en      β”‚
β”‚ at://did:plc:iyv5h…  β”‚ bafyreifieocd3grqb…  β”‚ rkv2401.bsky.social  β”‚ Does anyone know o…  β”‚ 2024-11-18T06:59:0…  β”‚ 2024-11-18T06:59:0…  β”‚ databs  β”‚ en      β”‚
β”‚ at://did:plc:je4jm…  β”‚ bafyreics4cctwgzw6…  β”‚ maninekkalapudi.io   β”‚ Entering the dark …  β”‚ 2024-11-18T03:51:5…  β”‚ 2024-11-18T03:51:5…  β”‚ databs  β”‚ en      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

You could do the same within MotherDuck’s platform and make use of the visualization features and the benefits of the collaborative notebook approach.

You can also use Jake’s great collection, where he shares the Jetstream as Cloudflare R2 to query openly with DuckDB:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
❯ duckdb
D attach 'https://hive.buz.dev/bluesky/catalog' as bsky;
select count(*) from bsky.jetstream;

100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–
D select count(*) from bsky.jetstream;

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ count_star() β”‚
β”‚    int64     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       500000 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

It also works in the browser - check it here DuckDB Wasm – DuckDB:

Image by Jake

What are people building?

There are currently many collaboration efforts going on, and hourly, new things are shared among the new, friendly Bluesky community. Many people try to help each other and build the best data tooling around Bluesky and ATProto. Here is the one I came across lately (I’m sorry if I forgot anyone):

I hope we can work together collaboratively and build the best Bluesky tools for data people. If not us, then who? πŸ˜€

Also, there are many more tools around charting, browsing starter packs, etc. Currently, I collect and update them regularly on :


Full article published at MotherDuck.com - written as part of my services
Discuss on Bluesky   |