How to scrape Bluesky with Python

Bluesky is an emerging social network developed by former members of the Twitter(now X) development team. The platform has been showing significant growth recently, reaching 140.3 million visits according to SimilarWeb. Like X, luesky generates a vast amount of data that can be used for analysis. In this article, we’ll explore how to collect this data using Crawlee for Python.

Note: One of our community members wrote this blog as a contribution to the Crawlee Blog. If you’d like to contribute articles like these, please reach out to us on our discord channel.

Key steps we will cover:

  1. Project setup
  2. Development of the Bluesky crawler in Python
  3. Create Apify Actor for Bluesky crawler
  4. Conclusion and repository access

Prerequisites

  • Basic understanding of web scraping concepts
  • Python 3.9 or higher
  • UV version 0.6.0 or higher
  • Crawlee for Python v0.6.5 or higher
  • Bluesky account for API access

Project setup

In this project, we’ll use UV for package management and a specific Python version installed through UV. UV is a fast and modern package manager written in Rust.

  1. If you don’t have UV installed yet, follow the guide or use this command:

    curl <span>-LsSf</span> https://astral.sh/uv/install.sh | sh
    curl <span>-LsSf</span> https://astral.sh/uv/install.sh | sh
    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Install standalone Python using UV:

    uv <span>install </span>python 3.13
    uv <span>install </span>python 3.13
    uv install python 3.13
  3. Create a new project and install Crawlee for Python:

    uv init bluesky-crawlee <span>--package</span>
    <span>cd </span>bluesky-crawlee
    uv add crawlee
    uv init bluesky-crawlee <span>--package</span>
    <span>cd </span>bluesky-crawlee
    uv add crawlee
    uv init bluesky-crawlee --package cd bluesky-crawlee uv add crawlee

We’ve created a new isolated Python project with all the necessary dependencies for Crawlee.

Development of the Bluesky crawler in Python

Note: Before going ahead with the project, I’d like to ask you to star Crawlee for Python on GitHub, it helps us to spread the word to fellow scraper developers.

1. Identifying the data source

When accessing the search page, you’ll see data displayed, but be aware of a key limitation: the site only allows viewing the first page of results, preventing access to any additional pages.

Fortunately, Bluesky provides a well-documented API that is accessible to any registered user without additional permissions. This is what we’ll use for data collection

2. Creating a session for API interaction

For secure API interaction, you need to create a dedicated app password instead of using your main account password.

Go to Settings -> Privacy and Security -> App Passwords and click Add App Password.
Important: Save the generated password, as it won’t be visible after creation.

Next, create environment variables to store your credentials:

  • Your application password
  • Your user identifier (found in your profile and Bluesky URL, for example: mantisus.bsky.social)
<span>export </span><span>BLUESKY_APP_PASSWORD</span><span>=</span>your_app_password
<span>export </span><span>BLUESKY_IDENTIFIER</span><span>=</span>your_identifier
<span>export </span><span>BLUESKY_APP_PASSWORD</span><span>=</span>your_app_password
<span>export </span><span>BLUESKY_IDENTIFIER</span><span>=</span>your_identifier
export BLUESKY_APP_PASSWORD=your_app_password export BLUESKY_IDENTIFIER=your_identifier

Enter fullscreen mode Exit fullscreen mode

Using the createSession, deleteSession endpoints and httpx, we can create a session for API interaction.

Let us create a class with the necessary methods:

<span>import</span> <span>asyncio</span>
<span>import</span> <span>json</span>
<span>import</span> <span>os</span>
<span>import</span> <span>traceback</span>
<span>import</span> <span>httpx</span>
<span>from</span> <span>yarl</span> <span>import</span> <span>URL</span>
<span>from</span> <span>crawlee</span> <span>import</span> <span>ConcurrencySettings</span><span>,</span> <span>Request</span>
<span>from</span> <span>crawlee.configuration</span> <span>import</span> <span>Configuration</span>
<span>from</span> <span>crawlee.crawlers</span> <span>import</span> <span>HttpCrawler</span><span>,</span> <span>HttpCrawlingContext</span>
<span>from</span> <span>crawlee.http_clients</span> <span>import</span> <span>HttpxHttpClient</span>
<span>from</span> <span>crawlee.storages</span> <span>import</span> <span>Dataset</span>
<span># Environment variables for authentication # BLUESKY_APP_PASSWORD: App-specific password generated from Bluesky settings # BLUESKY_IDENTIFIER: Your Bluesky handle (e.g., username.bsky.social) </span><span>BLUESKY_APP_PASSWORD</span> <span>=</span> <span>os</span><span>.</span><span>getenv</span><span>(</span><span>'</span><span>BLUESKY_APP_PASSWORD</span><span>'</span><span>)</span>
<span>BLUESKY_IDENTIFIER</span> <span>=</span> <span>os</span><span>.</span><span>getenv</span><span>(</span><span>'</span><span>BLUESKY_IDENTIFIER</span><span>'</span><span>)</span>
<span>class</span> <span>BlueskyApiScraper</span><span>:</span>
<span>"""</span><span>A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. </span><span>"""</span>
<span>def</span> <span>__init__</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>self</span><span>.</span><span>_crawler</span><span>:</span> <span>HttpCrawler</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_users</span><span>:</span> <span>Dataset</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_posts</span><span>:</span> <span>Dataset</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span># Variables for storing session data </span> <span>self</span><span>.</span><span>_service_endpoint</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_user_did</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_access_token</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_refresh_token</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_handle</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>def</span> <span>create_session</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Create credentials for the session.</span><span>"""</span>
<span>url</span> <span>=</span> <span>'</span><span>https://bsky.social/xrpc/com.atproto.server.createSession</span><span>'</span>
<span>headers</span> <span>=</span> <span>{</span>
<span>'</span><span>Content-Type</span><span>'</span><span>:</span> <span>'</span><span>application/json</span><span>'</span><span>,</span>
<span>}</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>identifier</span><span>'</span><span>:</span> <span>BLUESKY_IDENTIFIER</span><span>,</span> <span>'</span><span>password</span><span>'</span><span>:</span> <span>BLUESKY_APP_PASSWORD</span><span>}</span>
<span>response</span> <span>=</span> <span>httpx</span><span>.</span><span>post</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>,</span> <span>json</span><span>=</span><span>data</span><span>)</span>
<span>response</span><span>.</span><span>raise_for_status</span><span>()</span>
<span>data</span> <span>=</span> <span>response</span><span>.</span><span>json</span><span>()</span>
<span>self</span><span>.</span><span>_service_endpoint</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>didDoc</span><span>'</span><span>][</span><span>'</span><span>service</span><span>'</span><span>][</span><span>0</span><span>][</span><span>'</span><span>serviceEndpoint</span><span>'</span><span>]</span>
<span>self</span><span>.</span><span>_user_did</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>didDoc</span><span>'</span><span>][</span><span>'</span><span>id</span><span>'</span><span>]</span>
<span>self</span><span>.</span><span>_access_token</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>accessJwt</span><span>'</span><span>]</span>
<span>self</span><span>.</span><span>_refresh_token</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>refreshJwt</span><span>'</span><span>]</span>
<span>self</span><span>.</span><span>_handle</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>handle</span><span>'</span><span>]</span>
<span>def</span> <span>delete_session</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Delete the current session.</span><span>"""</span>
<span>url</span> <span>=</span> <span>f</span><span>'</span><span>{</span><span>self</span><span>.</span><span>_service_endpoint</span><span>}</span><span>/xrpc/com.atproto.server.deleteSession</span><span>'</span>
<span>headers</span> <span>=</span> <span>{</span><span>'</span><span>Content-Type</span><span>'</span><span>:</span> <span>'</span><span>application/json</span><span>'</span><span>,</span> <span>'</span><span>authorization</span><span>'</span><span>:</span> <span>f</span><span>'</span><span>Bearer </span><span>{</span><span>self</span><span>.</span><span>_refresh_token</span><span>}</span><span>'</span><span>}</span>
<span>response</span> <span>=</span> <span>httpx</span><span>.</span><span>post</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>)</span>
<span>response</span><span>.</span><span>raise_for_status</span><span>()</span>
<span>import</span> <span>asyncio</span>
<span>import</span> <span>json</span>
<span>import</span> <span>os</span>
<span>import</span> <span>traceback</span>

<span>import</span> <span>httpx</span>
<span>from</span> <span>yarl</span> <span>import</span> <span>URL</span>

<span>from</span> <span>crawlee</span> <span>import</span> <span>ConcurrencySettings</span><span>,</span> <span>Request</span>
<span>from</span> <span>crawlee.configuration</span> <span>import</span> <span>Configuration</span>
<span>from</span> <span>crawlee.crawlers</span> <span>import</span> <span>HttpCrawler</span><span>,</span> <span>HttpCrawlingContext</span>
<span>from</span> <span>crawlee.http_clients</span> <span>import</span> <span>HttpxHttpClient</span>
<span>from</span> <span>crawlee.storages</span> <span>import</span> <span>Dataset</span>

<span># Environment variables for authentication # BLUESKY_APP_PASSWORD: App-specific password generated from Bluesky settings # BLUESKY_IDENTIFIER: Your Bluesky handle (e.g., username.bsky.social) </span><span>BLUESKY_APP_PASSWORD</span> <span>=</span> <span>os</span><span>.</span><span>getenv</span><span>(</span><span>'</span><span>BLUESKY_APP_PASSWORD</span><span>'</span><span>)</span>
<span>BLUESKY_IDENTIFIER</span> <span>=</span> <span>os</span><span>.</span><span>getenv</span><span>(</span><span>'</span><span>BLUESKY_IDENTIFIER</span><span>'</span><span>)</span>


<span>class</span> <span>BlueskyApiScraper</span><span>:</span>
    <span>"""</span><span>A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. </span><span>"""</span>

    <span>def</span> <span>__init__</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
        <span>self</span><span>.</span><span>_crawler</span><span>:</span> <span>HttpCrawler</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>

        <span>self</span><span>.</span><span>_users</span><span>:</span> <span>Dataset</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_posts</span><span>:</span> <span>Dataset</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>

        <span># Variables for storing session data </span>        <span>self</span><span>.</span><span>_service_endpoint</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_user_did</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_access_token</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_refresh_token</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_handle</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>

    <span>def</span> <span>create_session</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
        <span>"""</span><span>Create credentials for the session.</span><span>"""</span>
        <span>url</span> <span>=</span> <span>'</span><span>https://bsky.social/xrpc/com.atproto.server.createSession</span><span>'</span>
        <span>headers</span> <span>=</span> <span>{</span>
            <span>'</span><span>Content-Type</span><span>'</span><span>:</span> <span>'</span><span>application/json</span><span>'</span><span>,</span>
        <span>}</span>
        <span>data</span> <span>=</span> <span>{</span><span>'</span><span>identifier</span><span>'</span><span>:</span> <span>BLUESKY_IDENTIFIER</span><span>,</span> <span>'</span><span>password</span><span>'</span><span>:</span> <span>BLUESKY_APP_PASSWORD</span><span>}</span>

        <span>response</span> <span>=</span> <span>httpx</span><span>.</span><span>post</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>,</span> <span>json</span><span>=</span><span>data</span><span>)</span>
        <span>response</span><span>.</span><span>raise_for_status</span><span>()</span>

        <span>data</span> <span>=</span> <span>response</span><span>.</span><span>json</span><span>()</span>

        <span>self</span><span>.</span><span>_service_endpoint</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>didDoc</span><span>'</span><span>][</span><span>'</span><span>service</span><span>'</span><span>][</span><span>0</span><span>][</span><span>'</span><span>serviceEndpoint</span><span>'</span><span>]</span>
        <span>self</span><span>.</span><span>_user_did</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>didDoc</span><span>'</span><span>][</span><span>'</span><span>id</span><span>'</span><span>]</span>
        <span>self</span><span>.</span><span>_access_token</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>accessJwt</span><span>'</span><span>]</span>
        <span>self</span><span>.</span><span>_refresh_token</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>refreshJwt</span><span>'</span><span>]</span>
        <span>self</span><span>.</span><span>_handle</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>handle</span><span>'</span><span>]</span>

    <span>def</span> <span>delete_session</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
        <span>"""</span><span>Delete the current session.</span><span>"""</span>
        <span>url</span> <span>=</span> <span>f</span><span>'</span><span>{</span><span>self</span><span>.</span><span>_service_endpoint</span><span>}</span><span>/xrpc/com.atproto.server.deleteSession</span><span>'</span>
        <span>headers</span> <span>=</span> <span>{</span><span>'</span><span>Content-Type</span><span>'</span><span>:</span> <span>'</span><span>application/json</span><span>'</span><span>,</span> <span>'</span><span>authorization</span><span>'</span><span>:</span> <span>f</span><span>'</span><span>Bearer </span><span>{</span><span>self</span><span>.</span><span>_refresh_token</span><span>}</span><span>'</span><span>}</span>

        <span>response</span> <span>=</span> <span>httpx</span><span>.</span><span>post</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>)</span>
        <span>response</span><span>.</span><span>raise_for_status</span><span>()</span>
import asyncio import json import os import traceback import httpx from yarl import URL from crawlee import ConcurrencySettings, Request from crawlee.configuration import Configuration from crawlee.crawlers import HttpCrawler, HttpCrawlingContext from crawlee.http_clients import HttpxHttpClient from crawlee.storages import Dataset # Environment variables for authentication # BLUESKY_APP_PASSWORD: App-specific password generated from Bluesky settings # BLUESKY_IDENTIFIER: Your Bluesky handle (e.g., username.bsky.social) BLUESKY_APP_PASSWORD = os.getenv('BLUESKY_APP_PASSWORD') BLUESKY_IDENTIFIER = os.getenv('BLUESKY_IDENTIFIER') class BlueskyApiScraper: """A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. """ def __init__(self) -> None: self._crawler: HttpCrawler | None = None self._users: Dataset | None = None self._posts: Dataset | None = None # Variables for storing session data self._service_endpoint: str | None = None self._user_did: str | None = None self._access_token: str | None = None self._refresh_token: str | None = None self._handle: str | None = None def create_session(self) -> None: """Create credentials for the session.""" url = 'https://bsky.social/xrpc/com.atproto.server.createSession' headers = { 'Content-Type': 'application/json', } data = {'identifier': BLUESKY_IDENTIFIER, 'password': BLUESKY_APP_PASSWORD} response = httpx.post(url, headers=headers, json=data) response.raise_for_status() data = response.json() self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint'] self._user_did = data['didDoc']['id'] self._access_token = data['accessJwt'] self._refresh_token = data['refreshJwt'] self._handle = data['handle'] def delete_session(self) -> None: """Delete the current session.""" url = f'{self._service_endpoint}/xrpc/com.atproto.server.deleteSession' headers = {'Content-Type': 'application/json', 'authorization': f'Bearer {self._refresh_token}'} response = httpx.post(url, headers=headers) response.raise_for_status()

Enter fullscreen mode Exit fullscreen mode

The session expires after 2 hours, so if you plan for your crawler to run longer, you should also add a method for refresh.

3. Configuring Crawlee for Python for data collection

Since we’ll be using the official API, we do not need to worry about being blocked by Bluesky. However, we should be careful with the number of requests to avoid overloading Bluesky’s servers, so we will configure ConcurrencySettings. We’ll also configure HttpxHttpClient to use custom headers with the current session’s Authorization.

We’ll use 2 endpoints for data collection: searchPosts for posts and getProfile. If you plan to scale the crawler, you can use getProfiles for user data, but in this case, you’ll need to implement deduplication logic. When each link is unique, Crawlee for Python handles this for you.

When collecting data, I’d like to separately collect user and post data, so we’ll use different Dataset instances for storage.

<span>async</span> <span>def</span> <span>init_crawler</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Initialize the crawler.</span><span>"""</span>
<span>if</span> <span>not</span> <span>self</span><span>.</span><span>_user_did</span><span>:</span>
<span>raise</span> <span>ValueError</span><span>(</span><span>'</span><span>Session not created.</span><span>'</span><span>)</span>
<span># Initialize the datasets purge the data if it is not empty </span> <span>self</span><span>.</span><span>_users</span> <span>=</span> <span>await</span> <span>Dataset</span><span>.</span><span>open</span><span>(</span><span>name</span><span>=</span><span>'</span><span>users</span><span>'</span><span>,</span> <span>configuration</span><span>=</span><span>Configuration</span><span>(</span><span>purge_on_start</span><span>=</span><span>True</span><span>))</span>
<span>self</span><span>.</span><span>_posts</span> <span>=</span> <span>await</span> <span>Dataset</span><span>.</span><span>open</span><span>(</span><span>name</span><span>=</span><span>'</span><span>posts</span><span>'</span><span>,</span> <span>configuration</span><span>=</span><span>Configuration</span><span>(</span><span>purge_on_start</span><span>=</span><span>True</span><span>))</span>
<span># Initialize the crawler </span> <span>self</span><span>.</span><span>_crawler</span> <span>=</span> <span>HttpCrawler</span><span>(</span>
<span>max_requests_per_crawl</span><span>=</span><span>100</span><span>,</span>
<span>http_client</span><span>=</span><span>HttpxHttpClient</span><span>(</span>
<span># Set headers for API requests </span> <span>headers</span><span>=</span><span>{</span>
<span>'</span><span>Content-Type</span><span>'</span><span>:</span> <span>'</span><span>application/json</span><span>'</span><span>,</span>
<span>'</span><span>Authorization</span><span>'</span><span>:</span> <span>f</span><span>'</span><span>Bearer </span><span>{</span><span>self</span><span>.</span><span>_access_token</span><span>}</span><span>'</span><span>,</span>
<span>'</span><span>Connection</span><span>'</span><span>:</span> <span>'</span><span>Keep-Alive</span><span>'</span><span>,</span>
<span>'</span><span>accept-encoding</span><span>'</span><span>:</span> <span>'</span><span>gzip, deflate, br, zstd</span><span>'</span><span>,</span>
<span>}</span>
<span>),</span>
<span># Configuring concurrency of crawling requests </span> <span>concurrency_settings</span><span>=</span><span>ConcurrencySettings</span><span>(</span>
<span>min_concurrency</span><span>=</span><span>10</span><span>,</span>
<span>desired_concurrency</span><span>=</span><span>10</span><span>,</span>
<span>max_concurrency</span><span>=</span><span>30</span><span>,</span>
<span>max_tasks_per_minute</span><span>=</span><span>200</span><span>,</span>
<span>),</span>
<span>)</span>
<span>self</span><span>.</span><span>_crawler</span><span>.</span><span>router</span><span>.</span><span>default_handler</span><span>(</span><span>self</span><span>.</span><span>_search_handler</span><span>)</span> <span># Handler for search requests </span> <span>self</span><span>.</span><span>_crawler</span><span>.</span><span>router</span><span>.</span><span>handler</span><span>(</span><span>label</span><span>=</span><span>'</span><span>user</span><span>'</span><span>)(</span><span>self</span><span>.</span><span>_user_handler</span><span>)</span> <span># Handler for user requests </span>
<span>async</span> <span>def</span> <span>init_crawler</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Initialize the crawler.</span><span>"""</span>
    <span>if</span> <span>not</span> <span>self</span><span>.</span><span>_user_did</span><span>:</span>
        <span>raise</span> <span>ValueError</span><span>(</span><span>'</span><span>Session not created.</span><span>'</span><span>)</span>

    <span># Initialize the datasets purge the data if it is not empty </span>    <span>self</span><span>.</span><span>_users</span> <span>=</span> <span>await</span> <span>Dataset</span><span>.</span><span>open</span><span>(</span><span>name</span><span>=</span><span>'</span><span>users</span><span>'</span><span>,</span> <span>configuration</span><span>=</span><span>Configuration</span><span>(</span><span>purge_on_start</span><span>=</span><span>True</span><span>))</span>
    <span>self</span><span>.</span><span>_posts</span> <span>=</span> <span>await</span> <span>Dataset</span><span>.</span><span>open</span><span>(</span><span>name</span><span>=</span><span>'</span><span>posts</span><span>'</span><span>,</span> <span>configuration</span><span>=</span><span>Configuration</span><span>(</span><span>purge_on_start</span><span>=</span><span>True</span><span>))</span>

    <span># Initialize the crawler </span>    <span>self</span><span>.</span><span>_crawler</span> <span>=</span> <span>HttpCrawler</span><span>(</span>
        <span>max_requests_per_crawl</span><span>=</span><span>100</span><span>,</span>
        <span>http_client</span><span>=</span><span>HttpxHttpClient</span><span>(</span>
            <span># Set headers for API requests </span>            <span>headers</span><span>=</span><span>{</span>
                <span>'</span><span>Content-Type</span><span>'</span><span>:</span> <span>'</span><span>application/json</span><span>'</span><span>,</span>
                <span>'</span><span>Authorization</span><span>'</span><span>:</span> <span>f</span><span>'</span><span>Bearer </span><span>{</span><span>self</span><span>.</span><span>_access_token</span><span>}</span><span>'</span><span>,</span>
                <span>'</span><span>Connection</span><span>'</span><span>:</span> <span>'</span><span>Keep-Alive</span><span>'</span><span>,</span>
                <span>'</span><span>accept-encoding</span><span>'</span><span>:</span> <span>'</span><span>gzip, deflate, br, zstd</span><span>'</span><span>,</span>
            <span>}</span>
        <span>),</span>
        <span># Configuring concurrency of crawling requests </span>        <span>concurrency_settings</span><span>=</span><span>ConcurrencySettings</span><span>(</span>
            <span>min_concurrency</span><span>=</span><span>10</span><span>,</span>
            <span>desired_concurrency</span><span>=</span><span>10</span><span>,</span>
            <span>max_concurrency</span><span>=</span><span>30</span><span>,</span>
            <span>max_tasks_per_minute</span><span>=</span><span>200</span><span>,</span>
        <span>),</span>
    <span>)</span>

    <span>self</span><span>.</span><span>_crawler</span><span>.</span><span>router</span><span>.</span><span>default_handler</span><span>(</span><span>self</span><span>.</span><span>_search_handler</span><span>)</span>  <span># Handler for search requests </span>    <span>self</span><span>.</span><span>_crawler</span><span>.</span><span>router</span><span>.</span><span>handler</span><span>(</span><span>label</span><span>=</span><span>'</span><span>user</span><span>'</span><span>)(</span><span>self</span><span>.</span><span>_user_handler</span><span>)</span>  <span># Handler for user requests </span>
async def init_crawler(self) -> None: """Initialize the crawler.""" if not self._user_did: raise ValueError('Session not created.') # Initialize the datasets purge the data if it is not empty self._users = await Dataset.open(name='users', configuration=Configuration(purge_on_start=True)) self._posts = await Dataset.open(name='posts', configuration=Configuration(purge_on_start=True)) # Initialize the crawler self._crawler = HttpCrawler( max_requests_per_crawl=100, http_client=HttpxHttpClient( # Set headers for API requests headers={ 'Content-Type': 'application/json', 'Authorization': f'Bearer {self._access_token}', 'Connection': 'Keep-Alive', 'accept-encoding': 'gzip, deflate, br, zstd', } ), # Configuring concurrency of crawling requests concurrency_settings=ConcurrencySettings( min_concurrency=10, desired_concurrency=10, max_concurrency=30, max_tasks_per_minute=200, ), ) self._crawler.router.default_handler(self._search_handler) # Handler for search requests self._crawler.router.handler(label='user')(self._user_handler) # Handler for user requests

Enter fullscreen mode Exit fullscreen mode

4. Implementing handlers for data collection

Now we can implement the handler for searching posts. We’ll save the retrieved posts in self._posts and create requests for user data, placing them in the crawler’s queue. We also need to handle pagination by forming the link to the next search page.

<span>async</span> <span>def</span> <span>_search_handler</span><span>(</span><span>self</span><span>,</span> <span>context</span><span>:</span> <span>HttpCrawlingContext</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>context</span><span>.</span><span>log</span><span>.</span><span>info</span><span>(</span><span>f</span><span>'</span><span>Processing search </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span> ...</span><span>'</span><span>)</span>
<span>data</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>context</span><span>.</span><span>http_response</span><span>.</span><span>read</span><span>())</span>
<span>if</span> <span>'</span><span>posts</span><span>'</span> <span>not</span> <span>in</span> <span>data</span><span>:</span>
<span>context</span><span>.</span><span>log</span><span>.</span><span>warning</span><span>(</span><span>f</span><span>'</span><span>No posts found in response: </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span>'</span><span>)</span>
<span>return</span>
<span>user_requests</span> <span>=</span> <span>{}</span>
<span>posts</span> <span>=</span> <span>[]</span>
<span>profile_url</span> <span>=</span> <span>URL</span><span>(</span><span>f</span><span>'</span><span>{</span><span>self</span><span>.</span><span>_service_endpoint</span><span>}</span><span>/xrpc/app.bsky.actor.getProfile</span><span>'</span><span>)</span>
<span>for</span> <span>post</span> <span>in</span> <span>data</span><span>[</span><span>'</span><span>posts</span><span>'</span><span>]:</span>
<span># Add user request if not already added in current context </span> <span>if</span> <span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>]</span> <span>not</span> <span>in</span> <span>user_requests</span><span>:</span>
<span>user_requests</span><span>[</span><span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>]]</span> <span>=</span> <span>Request</span><span>.</span><span>from_url</span><span>(</span>
<span>url</span><span>=</span><span>str</span><span>(</span><span>profile_url</span><span>.</span><span>with_query</span><span>(</span><span>actor</span><span>=</span><span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>])),</span>
<span>user_data</span><span>=</span><span>{</span><span>'</span><span>label</span><span>'</span><span>:</span> <span>'</span><span>user</span><span>'</span><span>},</span>
<span>)</span>
<span>posts</span><span>.</span><span>append</span><span>(</span>
<span>{</span>
<span>'</span><span>uri</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>uri</span><span>'</span><span>],</span>
<span>'</span><span>cid</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>cid</span><span>'</span><span>],</span>
<span>'</span><span>author_did</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>],</span>
<span>'</span><span>created</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>][</span><span>'</span><span>createdAt</span><span>'</span><span>],</span>
<span>'</span><span>indexed</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>indexedAt</span><span>'</span><span>],</span>
<span>'</span><span>reply_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>replyCount</span><span>'</span><span>],</span>
<span>'</span><span>repost_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>repostCount</span><span>'</span><span>],</span>
<span>'</span><span>like_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>likeCount</span><span>'</span><span>],</span>
<span>'</span><span>quote_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>quoteCount</span><span>'</span><span>],</span>
<span>'</span><span>text</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>][</span><span>'</span><span>text</span><span>'</span><span>],</span>
<span>'</span><span>langs</span><span>'</span><span>:</span> <span>'</span><span>; </span><span>'</span><span>.</span><span>join</span><span>(</span><span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>langs</span><span>'</span><span>,</span> <span>[])),</span>
<span>'</span><span>reply_parent</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>reply</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>parent</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>uri</span><span>'</span><span>),</span>
<span>'</span><span>reply_root</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>reply</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>root</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>uri</span><span>'</span><span>),</span>
<span>}</span>
<span>)</span>
<span>await</span> <span>self</span><span>.</span><span>_posts</span><span>.</span><span>push_data</span><span>(</span><span>posts</span><span>)</span> <span># Push a batch of posts to the dataset </span> <span>await</span> <span>context</span><span>.</span><span>add_requests</span><span>(</span><span>list</span><span>(</span><span>user_requests</span><span>.</span><span>values</span><span>()))</span>
<span>if</span> <span>cursor</span> <span>:</span><span>=</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>cursor</span><span>'</span><span>):</span>
<span>next_url</span> <span>=</span> <span>URL</span><span>(</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>).</span><span>update_query</span><span>({</span><span>'</span><span>cursor</span><span>'</span><span>:</span> <span>cursor</span><span>})</span> <span># Use yarl for update the query string </span>
<span>await</span> <span>context</span><span>.</span><span>add_requests</span><span>([</span><span>str</span><span>(</span><span>next_url</span><span>)])</span>
<span>async</span> <span>def</span> <span>_search_handler</span><span>(</span><span>self</span><span>,</span> <span>context</span><span>:</span> <span>HttpCrawlingContext</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
    <span>context</span><span>.</span><span>log</span><span>.</span><span>info</span><span>(</span><span>f</span><span>'</span><span>Processing search </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span> ...</span><span>'</span><span>)</span>

    <span>data</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>context</span><span>.</span><span>http_response</span><span>.</span><span>read</span><span>())</span>

    <span>if</span> <span>'</span><span>posts</span><span>'</span> <span>not</span> <span>in</span> <span>data</span><span>:</span>
        <span>context</span><span>.</span><span>log</span><span>.</span><span>warning</span><span>(</span><span>f</span><span>'</span><span>No posts found in response: </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span>'</span><span>)</span>
        <span>return</span>

    <span>user_requests</span> <span>=</span> <span>{}</span>
    <span>posts</span> <span>=</span> <span>[]</span>

    <span>profile_url</span> <span>=</span> <span>URL</span><span>(</span><span>f</span><span>'</span><span>{</span><span>self</span><span>.</span><span>_service_endpoint</span><span>}</span><span>/xrpc/app.bsky.actor.getProfile</span><span>'</span><span>)</span>

    <span>for</span> <span>post</span> <span>in</span> <span>data</span><span>[</span><span>'</span><span>posts</span><span>'</span><span>]:</span>
        <span># Add user request if not already added in current context </span>        <span>if</span> <span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>]</span> <span>not</span> <span>in</span> <span>user_requests</span><span>:</span>
            <span>user_requests</span><span>[</span><span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>]]</span> <span>=</span> <span>Request</span><span>.</span><span>from_url</span><span>(</span>
                <span>url</span><span>=</span><span>str</span><span>(</span><span>profile_url</span><span>.</span><span>with_query</span><span>(</span><span>actor</span><span>=</span><span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>])),</span>
                <span>user_data</span><span>=</span><span>{</span><span>'</span><span>label</span><span>'</span><span>:</span> <span>'</span><span>user</span><span>'</span><span>},</span>
            <span>)</span>

        <span>posts</span><span>.</span><span>append</span><span>(</span>
            <span>{</span>
                <span>'</span><span>uri</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>uri</span><span>'</span><span>],</span>
                <span>'</span><span>cid</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>cid</span><span>'</span><span>],</span>
                <span>'</span><span>author_did</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>],</span>
                <span>'</span><span>created</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>][</span><span>'</span><span>createdAt</span><span>'</span><span>],</span>
                <span>'</span><span>indexed</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>indexedAt</span><span>'</span><span>],</span>
                <span>'</span><span>reply_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>replyCount</span><span>'</span><span>],</span>
                <span>'</span><span>repost_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>repostCount</span><span>'</span><span>],</span>
                <span>'</span><span>like_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>likeCount</span><span>'</span><span>],</span>
                <span>'</span><span>quote_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>quoteCount</span><span>'</span><span>],</span>
                <span>'</span><span>text</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>][</span><span>'</span><span>text</span><span>'</span><span>],</span>
                <span>'</span><span>langs</span><span>'</span><span>:</span> <span>'</span><span>; </span><span>'</span><span>.</span><span>join</span><span>(</span><span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>langs</span><span>'</span><span>,</span> <span>[])),</span>
                <span>'</span><span>reply_parent</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>reply</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>parent</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>uri</span><span>'</span><span>),</span>
                <span>'</span><span>reply_root</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>reply</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>root</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>uri</span><span>'</span><span>),</span>
            <span>}</span>
        <span>)</span>

    <span>await</span> <span>self</span><span>.</span><span>_posts</span><span>.</span><span>push_data</span><span>(</span><span>posts</span><span>)</span>  <span># Push a batch of posts to the dataset </span>    <span>await</span> <span>context</span><span>.</span><span>add_requests</span><span>(</span><span>list</span><span>(</span><span>user_requests</span><span>.</span><span>values</span><span>()))</span>

    <span>if</span> <span>cursor</span> <span>:</span><span>=</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>cursor</span><span>'</span><span>):</span>
        <span>next_url</span> <span>=</span> <span>URL</span><span>(</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>).</span><span>update_query</span><span>({</span><span>'</span><span>cursor</span><span>'</span><span>:</span> <span>cursor</span><span>})</span>  <span># Use yarl for update the query string </span>
        <span>await</span> <span>context</span><span>.</span><span>add_requests</span><span>([</span><span>str</span><span>(</span><span>next_url</span><span>)])</span>
async def _search_handler(self, context: HttpCrawlingContext) -> None: context.log.info(f'Processing search {context.request.url} ...') data = json.loads(context.http_response.read()) if 'posts' not in data: context.log.warning(f'No posts found in response: {context.request.url}') return user_requests = {} posts = [] profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile') for post in data['posts']: # Add user request if not already added in current context if post['author']['did'] not in user_requests: user_requests[post['author']['did']] = Request.from_url( url=str(profile_url.with_query(actor=post['author']['did'])), user_data={'label': 'user'}, ) posts.append( { 'uri': post['uri'], 'cid': post['cid'], 'author_did': post['author']['did'], 'created': post['record']['createdAt'], 'indexed': post['indexedAt'], 'reply_count': post['replyCount'], 'repost_count': post['repostCount'], 'like_count': post['likeCount'], 'quote_count': post['quoteCount'], 'text': post['record']['text'], 'langs': '; '.join(post['record'].get('langs', [])), 'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'), 'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'), } ) await self._posts.push_data(posts) # Push a batch of posts to the dataset await context.add_requests(list(user_requests.values())) if cursor := data.get('cursor'): next_url = URL(context.request.url).update_query({'cursor': cursor}) # Use yarl for update the query string await context.add_requests([str(next_url)])

Enter fullscreen mode Exit fullscreen mode

When receiving user data, we’ll store it in the corresponding Dataset self._users

<span>async</span> <span>def</span> <span>_user_handler</span><span>(</span><span>self</span><span>,</span> <span>context</span><span>:</span> <span>HttpCrawlingContext</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>context</span><span>.</span><span>log</span><span>.</span><span>info</span><span>(</span><span>f</span><span>'</span><span>Processing user </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span> ...</span><span>'</span><span>)</span>
<span>data</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>context</span><span>.</span><span>http_response</span><span>.</span><span>read</span><span>())</span>
<span>user_item</span> <span>=</span> <span>{</span>
<span>'</span><span>did</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>did</span><span>'</span><span>],</span>
<span>'</span><span>created</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>createdAt</span><span>'</span><span>],</span>
<span>'</span><span>avatar</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>avatar</span><span>'</span><span>),</span>
<span>'</span><span>description</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>description</span><span>'</span><span>),</span>
<span>'</span><span>display_name</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>displayName</span><span>'</span><span>),</span>
<span>'</span><span>handle</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>handle</span><span>'</span><span>],</span>
<span>'</span><span>indexed</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>indexedAt</span><span>'</span><span>),</span>
<span>'</span><span>posts_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>postsCount</span><span>'</span><span>],</span>
<span>'</span><span>followers_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>followersCount</span><span>'</span><span>],</span>
<span>'</span><span>follows_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>followsCount</span><span>'</span><span>],</span>
<span>}</span>
<span>await</span> <span>self</span><span>.</span><span>_users</span><span>.</span><span>push_data</span><span>(</span><span>user_item</span><span>)</span>
<span>async</span> <span>def</span> <span>_user_handler</span><span>(</span><span>self</span><span>,</span> <span>context</span><span>:</span> <span>HttpCrawlingContext</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
    <span>context</span><span>.</span><span>log</span><span>.</span><span>info</span><span>(</span><span>f</span><span>'</span><span>Processing user </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span> ...</span><span>'</span><span>)</span>

    <span>data</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>context</span><span>.</span><span>http_response</span><span>.</span><span>read</span><span>())</span>

    <span>user_item</span> <span>=</span> <span>{</span>
        <span>'</span><span>did</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>did</span><span>'</span><span>],</span>
        <span>'</span><span>created</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>createdAt</span><span>'</span><span>],</span>
        <span>'</span><span>avatar</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>avatar</span><span>'</span><span>),</span>
        <span>'</span><span>description</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>description</span><span>'</span><span>),</span>
        <span>'</span><span>display_name</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>displayName</span><span>'</span><span>),</span>
        <span>'</span><span>handle</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>handle</span><span>'</span><span>],</span>
        <span>'</span><span>indexed</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>indexedAt</span><span>'</span><span>),</span>
        <span>'</span><span>posts_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>postsCount</span><span>'</span><span>],</span>
        <span>'</span><span>followers_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>followersCount</span><span>'</span><span>],</span>
        <span>'</span><span>follows_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>followsCount</span><span>'</span><span>],</span>
    <span>}</span>

    <span>await</span> <span>self</span><span>.</span><span>_users</span><span>.</span><span>push_data</span><span>(</span><span>user_item</span><span>)</span>
async def _user_handler(self, context: HttpCrawlingContext) -> None: context.log.info(f'Processing user {context.request.url} ...') data = json.loads(context.http_response.read()) user_item = { 'did': data['did'], 'created': data['createdAt'], 'avatar': data.get('avatar'), 'description': data.get('description'), 'display_name': data.get('displayName'), 'handle': data['handle'], 'indexed': data.get('indexedAt'), 'posts_count': data['postsCount'], 'followers_count': data['followersCount'], 'follows_count': data['followsCount'], } await self._users.push_data(user_item)

Enter fullscreen mode Exit fullscreen mode

5. Saving data to files

For saving results, we will use the write_to_json.

<span>async</span> <span>def</span> <span>save_data</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Save the data.</span><span>"""</span>
<span>if</span> <span>not</span> <span>self</span><span>.</span><span>_users</span> <span>or</span> <span>not</span> <span>self</span><span>.</span><span>_posts</span><span>:</span>
<span>raise</span> <span>ValueError</span><span>(</span><span>'</span><span>Datasets not initialized.</span><span>'</span><span>)</span>
<span>with</span> <span>open</span><span>(</span><span>'</span><span>users.json</span><span>'</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
<span>await</span> <span>self</span><span>.</span><span>_users</span><span>.</span><span>write_to_json</span><span>(</span><span>f</span><span>,</span> <span>indent</span><span>=</span><span>4</span><span>)</span>
<span>with</span> <span>open</span><span>(</span><span>'</span><span>posts.json</span><span>'</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
<span>await</span> <span>self</span><span>.</span><span>_posts</span><span>.</span><span>write_to_json</span><span>(</span><span>f</span><span>,</span> <span>indent</span><span>=</span><span>4</span><span>)</span>
<span>async</span> <span>def</span> <span>save_data</span><span>(</span><span>self</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Save the data.</span><span>"""</span>
    <span>if</span> <span>not</span> <span>self</span><span>.</span><span>_users</span> <span>or</span> <span>not</span> <span>self</span><span>.</span><span>_posts</span><span>:</span>
        <span>raise</span> <span>ValueError</span><span>(</span><span>'</span><span>Datasets not initialized.</span><span>'</span><span>)</span>

    <span>with</span> <span>open</span><span>(</span><span>'</span><span>users.json</span><span>'</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
        <span>await</span> <span>self</span><span>.</span><span>_users</span><span>.</span><span>write_to_json</span><span>(</span><span>f</span><span>,</span> <span>indent</span><span>=</span><span>4</span><span>)</span>

    <span>with</span> <span>open</span><span>(</span><span>'</span><span>posts.json</span><span>'</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
        <span>await</span> <span>self</span><span>.</span><span>_posts</span><span>.</span><span>write_to_json</span><span>(</span><span>f</span><span>,</span> <span>indent</span><span>=</span><span>4</span><span>)</span>
async def save_data(self) -> None: """Save the data.""" if not self._users or not self._posts: raise ValueError('Datasets not initialized.') with open('users.json', 'w') as f: await self._users.write_to_json(f, indent=4) with open('posts.json', 'w') as f: await self._posts.write_to_json(f, indent=4)

Enter fullscreen mode Exit fullscreen mode

6. Running the crawler

We have everything needed to complete the crawler. We just need a method to execute the crawling – let us call it crawl

<span>async</span> <span>def</span> <span>crawl</span><span>(</span><span>self</span><span>,</span> <span>queries</span><span>:</span> <span>list</span><span>[</span><span>str</span><span>])</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Crawl the given URL.</span><span>"""</span>
<span>if</span> <span>not</span> <span>self</span><span>.</span><span>_crawler</span><span>:</span>
<span>raise</span> <span>ValueError</span><span>(</span><span>'</span><span>Crawler not initialized.</span><span>'</span><span>)</span>
<span>search_url</span> <span>=</span> <span>URL</span><span>(</span><span>f</span><span>'</span><span>{</span><span>self</span><span>.</span><span>_service_endpoint</span><span>}</span><span>/xrpc/app.bsky.feed.searchPosts</span><span>'</span><span>)</span>
<span>await</span> <span>self</span><span>.</span><span>_crawler</span><span>.</span><span>run</span><span>([</span><span>str</span><span>(</span><span>search_url</span><span>.</span><span>with_query</span><span>(</span><span>q</span><span>=</span><span>query</span><span>))</span> <span>for</span> <span>query</span> <span>in</span> <span>queries</span><span>])</span>
<span>async</span> <span>def</span> <span>crawl</span><span>(</span><span>self</span><span>,</span> <span>queries</span><span>:</span> <span>list</span><span>[</span><span>str</span><span>])</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Crawl the given URL.</span><span>"""</span>
    <span>if</span> <span>not</span> <span>self</span><span>.</span><span>_crawler</span><span>:</span>
        <span>raise</span> <span>ValueError</span><span>(</span><span>'</span><span>Crawler not initialized.</span><span>'</span><span>)</span>

    <span>search_url</span> <span>=</span> <span>URL</span><span>(</span><span>f</span><span>'</span><span>{</span><span>self</span><span>.</span><span>_service_endpoint</span><span>}</span><span>/xrpc/app.bsky.feed.searchPosts</span><span>'</span><span>)</span>

    <span>await</span> <span>self</span><span>.</span><span>_crawler</span><span>.</span><span>run</span><span>([</span><span>str</span><span>(</span><span>search_url</span><span>.</span><span>with_query</span><span>(</span><span>q</span><span>=</span><span>query</span><span>))</span> <span>for</span> <span>query</span> <span>in</span> <span>queries</span><span>])</span>
async def crawl(self, queries: list[str]) -> None: """Crawl the given URL.""" if not self._crawler: raise ValueError('Crawler not initialized.') search_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.feed.searchPosts') await self._crawler.run([str(search_url.with_query(q=query)) for query in queries])

Enter fullscreen mode Exit fullscreen mode

Let’s finalize the code:

<span>async</span> <span>def</span> <span>run</span><span>()</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. </span><span>"""</span>
<span>scraper</span> <span>=</span> <span>BlueskyApiScraper</span><span>()</span>
<span>scraper</span><span>.</span><span>create_session</span><span>()</span>
<span>try</span><span>:</span>
<span>await</span> <span>scraper</span><span>.</span><span>init_crawler</span><span>()</span>
<span>await</span> <span>scraper</span><span>.</span><span>crawl</span><span>([</span><span>'</span><span>python</span><span>'</span><span>,</span> <span>'</span><span>apify</span><span>'</span><span>,</span> <span>'</span><span>crawlee</span><span>'</span><span>])</span>
<span>await</span> <span>scraper</span><span>.</span><span>save_data</span><span>()</span>
<span>except</span> <span>Exception</span><span>:</span>
<span>traceback</span><span>.</span><span>print_exc</span><span>()</span>
<span>finally</span><span>:</span>
<span>scraper</span><span>.</span><span>delete_session</span><span>()</span>
<span>def</span> <span>main</span><span>()</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Entry point for the crawler application.</span><span>"""</span>
<span>asyncio</span><span>.</span><span>run</span><span>(</span><span>run</span><span>())</span>
<span>async</span> <span>def</span> <span>run</span><span>()</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. </span><span>"""</span>
    <span>scraper</span> <span>=</span> <span>BlueskyApiScraper</span><span>()</span>
    <span>scraper</span><span>.</span><span>create_session</span><span>()</span>
    <span>try</span><span>:</span>
        <span>await</span> <span>scraper</span><span>.</span><span>init_crawler</span><span>()</span>
        <span>await</span> <span>scraper</span><span>.</span><span>crawl</span><span>([</span><span>'</span><span>python</span><span>'</span><span>,</span> <span>'</span><span>apify</span><span>'</span><span>,</span> <span>'</span><span>crawlee</span><span>'</span><span>])</span>
        <span>await</span> <span>scraper</span><span>.</span><span>save_data</span><span>()</span>
    <span>except</span> <span>Exception</span><span>:</span>
        <span>traceback</span><span>.</span><span>print_exc</span><span>()</span>
    <span>finally</span><span>:</span>
        <span>scraper</span><span>.</span><span>delete_session</span><span>()</span>


<span>def</span> <span>main</span><span>()</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Entry point for the crawler application.</span><span>"""</span>
    <span>asyncio</span><span>.</span><span>run</span><span>(</span><span>run</span><span>())</span>
async def run() -> None: """Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. """ scraper = BlueskyApiScraper() scraper.create_session() try: await scraper.init_crawler() await scraper.crawl(['python', 'apify', 'crawlee']) await scraper.save_data() except Exception: traceback.print_exc() finally: scraper.delete_session() def main() -> None: """Entry point for the crawler application.""" asyncio.run(run())

Enter fullscreen mode Exit fullscreen mode

If you check your pyproject.toml, you will see that UV created an entrypoint for running bluesky-crawlee = "bluesky_crawlee:main", so we can run our crawler simply by executing:

uv run bluesky-crawlee
uv run bluesky-crawlee
uv run bluesky-crawlee

Enter fullscreen mode Exit fullscreen mode

Let’s look at sample results:

Posts

Users

Create Apify Actor for Bluesky crawler

We already have a fully functional implementation for local execution. Let us explore how to adapt it for running on the Apify Platform and transform in Apify Actor.

An Actor is a simple and efficient way to deploy your code in the cloud infrastructure on the Apify Platform. You can flexibly interact with the Actor, schedule regular runs for monitoring data, or integrate with other tools to build data processing flows.

First, create an .actor directory with platform configuration files:

<span>mkdir</span> .actor <span>&&</span> <span>touch</span> .actor/<span>{</span>actor.json,Dockerfile,input_schema.json<span>}</span>
<span>mkdir</span> .actor <span>&&</span> <span>touch</span> .actor/<span>{</span>actor.json,Dockerfile,input_schema.json<span>}</span>
mkdir .actor && touch .actor/{actor.json,Dockerfile,input_schema.json}

Enter fullscreen mode Exit fullscreen mode

Then add Apify SDK for Python as a project dependency:

uv add apify
uv add apify
uv add apify

Enter fullscreen mode Exit fullscreen mode

Configure Dockerfile

We’ll use the official Apify Docker image along with recommended UV practices for Docker:

<span>FROM</span><span> apify/actor-python:3.13</span>
<span>ENV</span><span> PATH='/app/.venv/bin:$PATH'</span>
<span>WORKDIR</span><span> /app</span>
<span>COPY</span><span> --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/</span>
<span>COPY</span><span> pyproject.toml uv.lock ./</span>
<span>RUN </span>uv <span>sync</span> <span>--frozen</span> <span>--no-install-project</span> <span>--no-editable</span> <span>-q</span> <span>--no-dev</span>
<span>COPY</span><span> . .</span>
<span>RUN </span>uv <span>sync</span> <span>--frozen</span> <span>--no-editable</span> <span>-q</span> <span>--no-dev</span>
<span>CMD</span><span> ["bluesky-crawlee"]</span>
<span>FROM</span><span> apify/actor-python:3.13</span>

<span>ENV</span><span> PATH='/app/.venv/bin:$PATH'</span>

<span>WORKDIR</span><span> /app</span>

<span>COPY</span><span> --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/</span>

<span>COPY</span><span> pyproject.toml uv.lock ./</span>

<span>RUN </span>uv <span>sync</span> <span>--frozen</span> <span>--no-install-project</span> <span>--no-editable</span> <span>-q</span> <span>--no-dev</span>

<span>COPY</span><span> . .</span>

<span>RUN </span>uv <span>sync</span> <span>--frozen</span> <span>--no-editable</span> <span>-q</span> <span>--no-dev</span>

<span>CMD</span><span> ["bluesky-crawlee"]</span>
FROM apify/actor-python:3.13 ENV PATH='/app/.venv/bin:$PATH' WORKDIR /app COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ COPY pyproject.toml uv.lock ./ RUN uv sync --frozen --no-install-project --no-editable -q --no-dev COPY . . RUN uv sync --frozen --no-editable -q --no-dev CMD ["bluesky-crawlee"]

Enter fullscreen mode Exit fullscreen mode

Here, bluesky-crawlee refers to the entrypoint specified in pyproject.toml.

Define project metadata in actor.json

The actor.json file contains project metadata for Apify Platform. Follow the documentation for proper configuration:

<span>{</span><span> </span><span>"actorSpecification"</span><span>:</span><span> </span><span>1</span><span>,</span><span> </span><span>"name"</span><span>:</span><span> </span><span>"Bluesky-Crawlee"</span><span>,</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Bluesky - Crawlee"</span><span>,</span><span> </span><span>"minMemoryMbytes"</span><span>:</span><span> </span><span>128</span><span>,</span><span> </span><span>"maxMemoryMbytes"</span><span>:</span><span> </span><span>2048</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Scrape data products from bluesky"</span><span>,</span><span> </span><span>"version"</span><span>:</span><span> </span><span>"0.1"</span><span>,</span><span> </span><span>"meta"</span><span>:</span><span> </span><span>{</span><span> </span><span>"templateId"</span><span>:</span><span> </span><span>"bluesky-crawlee"</span><span> </span><span>},</span><span> </span><span>"input"</span><span>:</span><span> </span><span>"./input_schema.json"</span><span>,</span><span> </span><span>"dockerfile"</span><span>:</span><span> </span><span>"./Dockerfile"</span><span> </span><span>}</span><span> </span>
<span>{</span><span> </span><span>"actorSpecification"</span><span>:</span><span> </span><span>1</span><span>,</span><span> </span><span>"name"</span><span>:</span><span> </span><span>"Bluesky-Crawlee"</span><span>,</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Bluesky - Crawlee"</span><span>,</span><span> </span><span>"minMemoryMbytes"</span><span>:</span><span> </span><span>128</span><span>,</span><span> </span><span>"maxMemoryMbytes"</span><span>:</span><span> </span><span>2048</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Scrape data products from bluesky"</span><span>,</span><span> </span><span>"version"</span><span>:</span><span> </span><span>"0.1"</span><span>,</span><span> </span><span>"meta"</span><span>:</span><span> </span><span>{</span><span> </span><span>"templateId"</span><span>:</span><span> </span><span>"bluesky-crawlee"</span><span> </span><span>},</span><span> </span><span>"input"</span><span>:</span><span> </span><span>"./input_schema.json"</span><span>,</span><span> </span><span>"dockerfile"</span><span>:</span><span> </span><span>"./Dockerfile"</span><span> </span><span>}</span><span> </span>
{ "actorSpecification": 1, "name": "Bluesky-Crawlee", "title": "Bluesky - Crawlee", "minMemoryMbytes": 128, "maxMemoryMbytes": 2048, "description": "Scrape data products from bluesky", "version": "0.1", "meta": { "templateId": "bluesky-crawlee" }, "input": "./input_schema.json", "dockerfile": "./Dockerfile" }

Enter fullscreen mode Exit fullscreen mode

Define Actor input parameters

Our crawler requires several external parameters. Let’s define them:

  • identifier: User’s Bluesky identifier (encrypted for security)
  • appPassword: Bluesky app password (encrypted)
  • queries: List of search queries for crawling
  • maxRequestsPerCrawl: Optional limit for testing
  • mode: Choose between collecting posts or user data who post on specific topics

Configure the input schema following the specification:

<span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Bluesky - Crawlee"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"object"</span><span>,</span><span> </span><span>"schemaVersion"</span><span>:</span><span> </span><span>1</span><span>,</span><span> </span><span>"properties"</span><span>:</span><span> </span><span>{</span><span> </span><span>"identifier"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Bluesky identifier"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Bluesky identifier for API login"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"string"</span><span>,</span><span> </span><span>"editor"</span><span>:</span><span> </span><span>"textfield"</span><span>,</span><span> </span><span>"isSecret"</span><span>:</span><span> </span><span>true</span><span> </span><span>},</span><span> </span><span>"appPassword"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Bluesky app password"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Bluesky app password for API"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"string"</span><span>,</span><span> </span><span>"editor"</span><span>:</span><span> </span><span>"textfield"</span><span>,</span><span> </span><span>"isSecret"</span><span>:</span><span> </span><span>true</span><span> </span><span>},</span><span> </span><span>"maxRequestsPerCrawl"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Max requests per crawl"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Maximum number of requests for crawling"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"integer"</span><span> </span><span>},</span><span> </span><span>"queries"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Queries"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"array"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Search queries"</span><span>,</span><span> </span><span>"editor"</span><span>:</span><span> </span><span>"stringList"</span><span>,</span><span> </span><span>"prefill"</span><span>:</span><span> </span><span>[</span><span> </span><span>"apify"</span><span> </span><span>],</span><span> </span><span>"example"</span><span>:</span><span> </span><span>[</span><span> </span><span>"apify"</span><span>,</span><span> </span><span>"crawlee"</span><span> </span><span>]</span><span> </span><span>},</span><span> </span><span>"mode"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Mode"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"string"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Collect posts or users who post on a topic"</span><span>,</span><span> </span><span>"enum"</span><span>:</span><span> </span><span>[</span><span> </span><span>"posts"</span><span>,</span><span> </span><span>"users"</span><span> </span><span>],</span><span> </span><span>"default"</span><span>:</span><span> </span><span>"posts"</span><span> </span><span>}</span><span> </span><span>},</span><span> </span><span>"required"</span><span>:</span><span> </span><span>[</span><span> </span><span>"identifier"</span><span>,</span><span> </span><span>"appPassword"</span><span>,</span><span> </span><span>"queries"</span><span>,</span><span> </span><span>"mode"</span><span> </span><span>]</span><span> </span><span>}</span><span> </span>
<span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Bluesky - Crawlee"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"object"</span><span>,</span><span> </span><span>"schemaVersion"</span><span>:</span><span> </span><span>1</span><span>,</span><span> </span><span>"properties"</span><span>:</span><span> </span><span>{</span><span> </span><span>"identifier"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Bluesky identifier"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Bluesky identifier for API login"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"string"</span><span>,</span><span> </span><span>"editor"</span><span>:</span><span> </span><span>"textfield"</span><span>,</span><span> </span><span>"isSecret"</span><span>:</span><span> </span><span>true</span><span> </span><span>},</span><span> </span><span>"appPassword"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Bluesky app password"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Bluesky app password for API"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"string"</span><span>,</span><span> </span><span>"editor"</span><span>:</span><span> </span><span>"textfield"</span><span>,</span><span> </span><span>"isSecret"</span><span>:</span><span> </span><span>true</span><span> </span><span>},</span><span> </span><span>"maxRequestsPerCrawl"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Max requests per crawl"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Maximum number of requests for crawling"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"integer"</span><span> </span><span>},</span><span> </span><span>"queries"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Queries"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"array"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Search queries"</span><span>,</span><span> </span><span>"editor"</span><span>:</span><span> </span><span>"stringList"</span><span>,</span><span> </span><span>"prefill"</span><span>:</span><span> </span><span>[</span><span> </span><span>"apify"</span><span> </span><span>],</span><span> </span><span>"example"</span><span>:</span><span> </span><span>[</span><span> </span><span>"apify"</span><span>,</span><span> </span><span>"crawlee"</span><span> </span><span>]</span><span> </span><span>},</span><span> </span><span>"mode"</span><span>:</span><span> </span><span>{</span><span> </span><span>"title"</span><span>:</span><span> </span><span>"Mode"</span><span>,</span><span> </span><span>"type"</span><span>:</span><span> </span><span>"string"</span><span>,</span><span> </span><span>"description"</span><span>:</span><span> </span><span>"Collect posts or users who post on a topic"</span><span>,</span><span> </span><span>"enum"</span><span>:</span><span> </span><span>[</span><span> </span><span>"posts"</span><span>,</span><span> </span><span>"users"</span><span> </span><span>],</span><span> </span><span>"default"</span><span>:</span><span> </span><span>"posts"</span><span> </span><span>}</span><span> </span><span>},</span><span> </span><span>"required"</span><span>:</span><span> </span><span>[</span><span> </span><span>"identifier"</span><span>,</span><span> </span><span>"appPassword"</span><span>,</span><span> </span><span>"queries"</span><span>,</span><span> </span><span>"mode"</span><span> </span><span>]</span><span> </span><span>}</span><span> </span>
{ "title": "Bluesky - Crawlee", "type": "object", "schemaVersion": 1, "properties": { "identifier": { "title": "Bluesky identifier", "description": "Bluesky identifier for API login", "type": "string", "editor": "textfield", "isSecret": true }, "appPassword": { "title": "Bluesky app password", "description": "Bluesky app password for API", "type": "string", "editor": "textfield", "isSecret": true }, "maxRequestsPerCrawl": { "title": "Max requests per crawl", "description": "Maximum number of requests for crawling", "type": "integer" }, "queries": { "title": "Queries", "type": "array", "description": "Search queries", "editor": "stringList", "prefill": [ "apify" ], "example": [ "apify", "crawlee" ] }, "mode": { "title": "Mode", "type": "string", "description": "Collect posts or users who post on a topic", "enum": [ "posts", "users" ], "default": "posts" } }, "required": [ "identifier", "appPassword", "queries", "mode" ] }

Enter fullscreen mode Exit fullscreen mode

Update project code

Remove environment variables and parameterize the code according to the Actor input parameters. Replace named datasets with the default dataset.

Add Actor logging:

<span># __init__.py </span>
<span>import</span> <span>logging</span>
<span>from</span> <span>apify.log</span> <span>import</span> <span>ActorLogFormatter</span>
<span>handler</span> <span>=</span> <span>logging</span><span>.</span><span>StreamHandler</span><span>()</span>
<span>handler</span><span>.</span><span>setFormatter</span><span>(</span><span>ActorLogFormatter</span><span>())</span>
<span>apify_client_logger</span> <span>=</span> <span>logging</span><span>.</span><span>getLogger</span><span>(</span><span>'</span><span>apify_client</span><span>'</span><span>)</span>
<span>apify_client_logger</span><span>.</span><span>setLevel</span><span>(</span><span>logging</span><span>.</span><span>INFO</span><span>)</span>
<span>apify_client_logger</span><span>.</span><span>addHandler</span><span>(</span><span>handler</span><span>)</span>
<span>apify_logger</span> <span>=</span> <span>logging</span><span>.</span><span>getLogger</span><span>(</span><span>'</span><span>apify</span><span>'</span><span>)</span>
<span>apify_logger</span><span>.</span><span>setLevel</span><span>(</span><span>logging</span><span>.</span><span>DEBUG</span><span>)</span>
<span>apify_logger</span><span>.</span><span>addHandler</span><span>(</span><span>handler</span><span>)</span>
<span># __init__.py </span>
<span>import</span> <span>logging</span>

<span>from</span> <span>apify.log</span> <span>import</span> <span>ActorLogFormatter</span>

<span>handler</span> <span>=</span> <span>logging</span><span>.</span><span>StreamHandler</span><span>()</span>
<span>handler</span><span>.</span><span>setFormatter</span><span>(</span><span>ActorLogFormatter</span><span>())</span>

<span>apify_client_logger</span> <span>=</span> <span>logging</span><span>.</span><span>getLogger</span><span>(</span><span>'</span><span>apify_client</span><span>'</span><span>)</span>
<span>apify_client_logger</span><span>.</span><span>setLevel</span><span>(</span><span>logging</span><span>.</span><span>INFO</span><span>)</span>
<span>apify_client_logger</span><span>.</span><span>addHandler</span><span>(</span><span>handler</span><span>)</span>

<span>apify_logger</span> <span>=</span> <span>logging</span><span>.</span><span>getLogger</span><span>(</span><span>'</span><span>apify</span><span>'</span><span>)</span>
<span>apify_logger</span><span>.</span><span>setLevel</span><span>(</span><span>logging</span><span>.</span><span>DEBUG</span><span>)</span>
<span>apify_logger</span><span>.</span><span>addHandler</span><span>(</span><span>handler</span><span>)</span>
# __init__.py import logging from apify.log import ActorLogFormatter handler = logging.StreamHandler() handler.setFormatter(ActorLogFormatter()) apify_client_logger = logging.getLogger('apify_client') apify_client_logger.setLevel(logging.INFO) apify_client_logger.addHandler(handler) apify_logger = logging.getLogger('apify') apify_logger.setLevel(logging.DEBUG) apify_logger.addHandler(handler)

Enter fullscreen mode Exit fullscreen mode

Update imports and entry point code:

<span>import</span> <span>asyncio</span>
<span>import</span> <span>json</span>
<span>import</span> <span>traceback</span>
<span>from</span> <span>dataclasses</span> <span>import</span> <span>dataclass</span>
<span>import</span> <span>httpx</span>
<span>from</span> <span>apify</span> <span>import</span> <span>Actor</span>
<span>from</span> <span>yarl</span> <span>import</span> <span>URL</span>
<span>from</span> <span>crawlee</span> <span>import</span> <span>ConcurrencySettings</span><span>,</span> <span>Request</span>
<span>from</span> <span>crawlee.crawlers</span> <span>import</span> <span>HttpCrawler</span><span>,</span> <span>HttpCrawlingContext</span>
<span>from</span> <span>crawlee.http_clients</span> <span>import</span> <span>HttpxHttpClient</span>
<span>@dataclass</span>
<span>class</span> <span>ActorInput</span><span>:</span>
<span>"""</span><span>Actor input schema.</span><span>"""</span>
<span>identifier</span><span>:</span> <span>str</span>
<span>app_password</span><span>:</span> <span>str</span>
<span>queries</span><span>:</span> <span>list</span><span>[</span><span>str</span><span>]</span>
<span>mode</span><span>:</span> <span>str</span>
<span>max_requests_per_crawl</span><span>:</span> <span>Optional</span><span>[</span><span>int</span><span>]</span> <span>=</span> <span>None</span>
<span>async</span> <span>def</span> <span>run</span><span>()</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. </span><span>"""</span>
<span>async</span> <span>with</span> <span>Actor</span><span>:</span>
<span>raw_input</span> <span>=</span> <span>await</span> <span>Actor</span><span>.</span><span>get_input</span><span>()</span>
<span>actor_input</span> <span>=</span> <span>ActorInput</span><span>(</span>
<span>identifier</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>indentifier</span><span>'</span><span>,</span> <span>''</span><span>),</span>
<span>app_password</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>appPassword</span><span>'</span><span>,</span> <span>''</span><span>),</span>
<span>queries</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>queries</span><span>'</span><span>,</span> <span>[]),</span>
<span>mode</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>mode</span><span>'</span><span>,</span> <span>'</span><span>posts</span><span>'</span><span>),</span>
<span>max_requests_per_crawl</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>maxRequestsPerCrawl</span><span>'</span><span>)</span>
<span>)</span>
<span>scraper</span> <span>=</span> <span>BlueskyApiScraper</span><span>(</span><span>actor_input</span><span>.</span><span>mode</span><span>,</span> <span>actor_input</span><span>.</span><span>max_requests_per_crawl</span><span>)</span>
<span>try</span><span>:</span>
<span>scraper</span><span>.</span><span>create_session</span><span>(</span><span>actor_input</span><span>.</span><span>identifier</span><span>,</span> <span>actor_input</span><span>.</span><span>app_password</span><span>)</span>
<span>await</span> <span>scraper</span><span>.</span><span>init_crawler</span><span>()</span>
<span>await</span> <span>scraper</span><span>.</span><span>crawl</span><span>(</span><span>actor_input</span><span>.</span><span>queries</span><span>)</span>
<span>except</span> <span>httpx</span><span>.</span><span>HTTPError</span> <span>as</span> <span>e</span><span>:</span>
<span>Actor</span><span>.</span><span>log</span><span>.</span><span>error</span><span>(</span><span>f</span><span>'</span><span>HTTP error occurred: </span><span>{</span><span>e</span><span>}</span><span>'</span><span>)</span>
<span>raise</span>
<span>except</span> <span>Exception</span> <span>as</span> <span>e</span><span>:</span>
<span>Actor</span><span>.</span><span>log</span><span>.</span><span>error</span><span>(</span><span>f</span><span>'</span><span>Unexpected error: </span><span>{</span><span>e</span><span>}</span><span>'</span><span>)</span>
<span>traceback</span><span>.</span><span>print_exc</span><span>()</span>
<span>finally</span><span>:</span>
<span>scraper</span><span>.</span><span>delete_session</span><span>()</span>
<span>def</span> <span>main</span><span>()</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Entry point for the scraper application.</span><span>"""</span>
<span>asyncio</span><span>.</span><span>run</span><span>(</span><span>run</span><span>())</span>
<span>import</span> <span>asyncio</span>
<span>import</span> <span>json</span>
<span>import</span> <span>traceback</span>
<span>from</span> <span>dataclasses</span> <span>import</span> <span>dataclass</span>

<span>import</span> <span>httpx</span>
<span>from</span> <span>apify</span> <span>import</span> <span>Actor</span>
<span>from</span> <span>yarl</span> <span>import</span> <span>URL</span>

<span>from</span> <span>crawlee</span> <span>import</span> <span>ConcurrencySettings</span><span>,</span> <span>Request</span>
<span>from</span> <span>crawlee.crawlers</span> <span>import</span> <span>HttpCrawler</span><span>,</span> <span>HttpCrawlingContext</span>
<span>from</span> <span>crawlee.http_clients</span> <span>import</span> <span>HttpxHttpClient</span>


<span>@dataclass</span>
<span>class</span> <span>ActorInput</span><span>:</span>
    <span>"""</span><span>Actor input schema.</span><span>"""</span>
    <span>identifier</span><span>:</span> <span>str</span>
    <span>app_password</span><span>:</span> <span>str</span>
    <span>queries</span><span>:</span> <span>list</span><span>[</span><span>str</span><span>]</span>
    <span>mode</span><span>:</span> <span>str</span>
    <span>max_requests_per_crawl</span><span>:</span> <span>Optional</span><span>[</span><span>int</span><span>]</span> <span>=</span> <span>None</span>


<span>async</span> <span>def</span> <span>run</span><span>()</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. </span><span>"""</span>
    <span>async</span> <span>with</span> <span>Actor</span><span>:</span>
        <span>raw_input</span> <span>=</span> <span>await</span> <span>Actor</span><span>.</span><span>get_input</span><span>()</span>
        <span>actor_input</span> <span>=</span> <span>ActorInput</span><span>(</span>
            <span>identifier</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>indentifier</span><span>'</span><span>,</span> <span>''</span><span>),</span>
            <span>app_password</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>appPassword</span><span>'</span><span>,</span> <span>''</span><span>),</span>
            <span>queries</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>queries</span><span>'</span><span>,</span> <span>[]),</span>
            <span>mode</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>mode</span><span>'</span><span>,</span> <span>'</span><span>posts</span><span>'</span><span>),</span>
            <span>max_requests_per_crawl</span><span>=</span><span>raw_input</span><span>.</span><span>get</span><span>(</span><span>'</span><span>maxRequestsPerCrawl</span><span>'</span><span>)</span>
        <span>)</span>
        <span>scraper</span> <span>=</span> <span>BlueskyApiScraper</span><span>(</span><span>actor_input</span><span>.</span><span>mode</span><span>,</span> <span>actor_input</span><span>.</span><span>max_requests_per_crawl</span><span>)</span>
        <span>try</span><span>:</span>
            <span>scraper</span><span>.</span><span>create_session</span><span>(</span><span>actor_input</span><span>.</span><span>identifier</span><span>,</span> <span>actor_input</span><span>.</span><span>app_password</span><span>)</span>

            <span>await</span> <span>scraper</span><span>.</span><span>init_crawler</span><span>()</span>
            <span>await</span> <span>scraper</span><span>.</span><span>crawl</span><span>(</span><span>actor_input</span><span>.</span><span>queries</span><span>)</span>
        <span>except</span> <span>httpx</span><span>.</span><span>HTTPError</span> <span>as</span> <span>e</span><span>:</span>
            <span>Actor</span><span>.</span><span>log</span><span>.</span><span>error</span><span>(</span><span>f</span><span>'</span><span>HTTP error occurred: </span><span>{</span><span>e</span><span>}</span><span>'</span><span>)</span>
            <span>raise</span>
        <span>except</span> <span>Exception</span> <span>as</span> <span>e</span><span>:</span>
            <span>Actor</span><span>.</span><span>log</span><span>.</span><span>error</span><span>(</span><span>f</span><span>'</span><span>Unexpected error: </span><span>{</span><span>e</span><span>}</span><span>'</span><span>)</span>
            <span>traceback</span><span>.</span><span>print_exc</span><span>()</span>
        <span>finally</span><span>:</span>
            <span>scraper</span><span>.</span><span>delete_session</span><span>()</span>

<span>def</span> <span>main</span><span>()</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Entry point for the scraper application.</span><span>"""</span>
    <span>asyncio</span><span>.</span><span>run</span><span>(</span><span>run</span><span>())</span>
import asyncio import json import traceback from dataclasses import dataclass import httpx from apify import Actor from yarl import URL from crawlee import ConcurrencySettings, Request from crawlee.crawlers import HttpCrawler, HttpCrawlingContext from crawlee.http_clients import HttpxHttpClient @dataclass class ActorInput: """Actor input schema.""" identifier: str app_password: str queries: list[str] mode: str max_requests_per_crawl: Optional[int] = None async def run() -> None: """Main execution function that orchestrates the crawling process. Creates a scraper instance, manages the session, and handles the complete crawling lifecycle including proper cleanup on completion or error. """ async with Actor: raw_input = await Actor.get_input() actor_input = ActorInput( identifier=raw_input.get('indentifier', ''), app_password=raw_input.get('appPassword', ''), queries=raw_input.get('queries', []), mode=raw_input.get('mode', 'posts'), max_requests_per_crawl=raw_input.get('maxRequestsPerCrawl') ) scraper = BlueskyApiScraper(actor_input.mode, actor_input.max_requests_per_crawl) try: scraper.create_session(actor_input.identifier, actor_input.app_password) await scraper.init_crawler() await scraper.crawl(actor_input.queries) except httpx.HTTPError as e: Actor.log.error(f'HTTP error occurred: {e}') raise except Exception as e: Actor.log.error(f'Unexpected error: {e}') traceback.print_exc() finally: scraper.delete_session() def main() -> None: """Entry point for the scraper application.""" asyncio.run(run())

Enter fullscreen mode Exit fullscreen mode

Update methods with Actor input parameters:

<span>class</span> <span>BlueskyApiScraper</span><span>:</span>
<span>"""</span><span>A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. </span><span>"""</span>
<span>def</span> <span>__init__</span><span>(</span><span>self</span><span>,</span> <span>mode</span><span>:</span> <span>str</span><span>,</span> <span>max_request</span><span>:</span> <span>int</span> <span>|</span> <span>None</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>self</span><span>.</span><span>_crawler</span><span>:</span> <span>HttpCrawler</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>mode</span> <span>=</span> <span>mode</span>
<span>self</span><span>.</span><span>max_request</span> <span>=</span> <span>max_request</span>
<span># Variables for storing session data </span> <span>self</span><span>.</span><span>_service_endpoint</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_user_did</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_access_token</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_refresh_token</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>self</span><span>.</span><span>_handle</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
<span>def</span> <span>create_session</span><span>(</span><span>self</span><span>,</span> <span>identifier</span><span>:</span> <span>str</span><span>,</span> <span>password</span><span>:</span> <span>str</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Create credentials for the session.</span><span>"""</span>
<span>url</span> <span>=</span> <span>'</span><span>https://bsky.social/xrpc/com.atproto.server.createSession</span><span>'</span>
<span>headers</span> <span>=</span> <span>{</span>
<span>'</span><span>Content-Type</span><span>'</span><span>:</span> <span>'</span><span>application/json</span><span>'</span><span>,</span>
<span>}</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>identifier</span><span>'</span><span>:</span> <span>identifier</span><span>,</span> <span>'</span><span>password</span><span>'</span><span>:</span> <span>password</span><span>}</span>
<span>response</span> <span>=</span> <span>httpx</span><span>.</span><span>post</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>,</span> <span>json</span><span>=</span><span>data</span><span>)</span>
<span>response</span><span>.</span><span>raise_for_status</span><span>()</span>
<span>data</span> <span>=</span> <span>response</span><span>.</span><span>json</span><span>()</span>
<span>self</span><span>.</span><span>_service_endpoint</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>didDoc</span><span>'</span><span>][</span><span>'</span><span>service</span><span>'</span><span>][</span><span>0</span><span>][</span><span>'</span><span>serviceEndpoint</span><span>'</span><span>]</span>
<span>self</span><span>.</span><span>_user_did</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>didDoc</span><span>'</span><span>][</span><span>'</span><span>id</span><span>'</span><span>]</span>
<span>self</span><span>.</span><span>_access_token</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>accessJwt</span><span>'</span><span>]</span>
<span>self</span><span>.</span><span>_refresh_token</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>refreshJwt</span><span>'</span><span>]</span>
<span>self</span><span>.</span><span>_handle</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>handle</span><span>'</span><span>]</span>
<span>class</span> <span>BlueskyApiScraper</span><span>:</span>
    <span>"""</span><span>A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. </span><span>"""</span>

    <span>def</span> <span>__init__</span><span>(</span><span>self</span><span>,</span> <span>mode</span><span>:</span> <span>str</span><span>,</span> <span>max_request</span><span>:</span> <span>int</span> <span>|</span> <span>None</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
        <span>self</span><span>.</span><span>_crawler</span><span>:</span> <span>HttpCrawler</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>

        <span>self</span><span>.</span><span>mode</span> <span>=</span> <span>mode</span>
        <span>self</span><span>.</span><span>max_request</span> <span>=</span> <span>max_request</span>

        <span># Variables for storing session data </span>        <span>self</span><span>.</span><span>_service_endpoint</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_user_did</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_access_token</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_refresh_token</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>
        <span>self</span><span>.</span><span>_handle</span><span>:</span> <span>str</span> <span>|</span> <span>None</span> <span>=</span> <span>None</span>

    <span>def</span> <span>create_session</span><span>(</span><span>self</span><span>,</span> <span>identifier</span><span>:</span> <span>str</span><span>,</span> <span>password</span><span>:</span> <span>str</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
        <span>"""</span><span>Create credentials for the session.</span><span>"""</span>
        <span>url</span> <span>=</span> <span>'</span><span>https://bsky.social/xrpc/com.atproto.server.createSession</span><span>'</span>
        <span>headers</span> <span>=</span> <span>{</span>
            <span>'</span><span>Content-Type</span><span>'</span><span>:</span> <span>'</span><span>application/json</span><span>'</span><span>,</span>
        <span>}</span>
        <span>data</span> <span>=</span> <span>{</span><span>'</span><span>identifier</span><span>'</span><span>:</span> <span>identifier</span><span>,</span> <span>'</span><span>password</span><span>'</span><span>:</span> <span>password</span><span>}</span>

        <span>response</span> <span>=</span> <span>httpx</span><span>.</span><span>post</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>,</span> <span>json</span><span>=</span><span>data</span><span>)</span>
        <span>response</span><span>.</span><span>raise_for_status</span><span>()</span>

        <span>data</span> <span>=</span> <span>response</span><span>.</span><span>json</span><span>()</span>

        <span>self</span><span>.</span><span>_service_endpoint</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>didDoc</span><span>'</span><span>][</span><span>'</span><span>service</span><span>'</span><span>][</span><span>0</span><span>][</span><span>'</span><span>serviceEndpoint</span><span>'</span><span>]</span>
        <span>self</span><span>.</span><span>_user_did</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>didDoc</span><span>'</span><span>][</span><span>'</span><span>id</span><span>'</span><span>]</span>
        <span>self</span><span>.</span><span>_access_token</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>accessJwt</span><span>'</span><span>]</span>
        <span>self</span><span>.</span><span>_refresh_token</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>refreshJwt</span><span>'</span><span>]</span>
        <span>self</span><span>.</span><span>_handle</span> <span>=</span> <span>data</span><span>[</span><span>'</span><span>handle</span><span>'</span><span>]</span>
class BlueskyApiScraper: """A scraper class for extracting data from Bluesky social network using their official API. This scraper manages authentication, concurrent requests, and data collection for both posts and user profiles. It uses separate datasets for storing post and user information. """ def __init__(self, mode: str, max_request: int | None) -> None: self._crawler: HttpCrawler | None = None self.mode = mode self.max_request = max_request # Variables for storing session data self._service_endpoint: str | None = None self._user_did: str | None = None self._access_token: str | None = None self._refresh_token: str | None = None self._handle: str | None = None def create_session(self, identifier: str, password: str) -> None: """Create credentials for the session.""" url = 'https://bsky.social/xrpc/com.atproto.server.createSession' headers = { 'Content-Type': 'application/json', } data = {'identifier': identifier, 'password': password} response = httpx.post(url, headers=headers, json=data) response.raise_for_status() data = response.json() self._service_endpoint = data['didDoc']['service'][0]['serviceEndpoint'] self._user_did = data['didDoc']['id'] self._access_token = data['accessJwt'] self._refresh_token = data['refreshJwt'] self._handle = data['handle']

Enter fullscreen mode Exit fullscreen mode

Implement mode-aware data collection logic:

<span>async</span> <span>def</span> <span>_search_handler</span><span>(</span><span>self</span><span>,</span> <span>context</span><span>:</span> <span>HttpCrawlingContext</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Handle search requests based on mode.</span><span>"""</span>
<span>context</span><span>.</span><span>log</span><span>.</span><span>info</span><span>(</span><span>f</span><span>'</span><span>Processing search </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span> ...</span><span>'</span><span>)</span>
<span>data</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>context</span><span>.</span><span>http_response</span><span>.</span><span>read</span><span>())</span>
<span>if</span> <span>'</span><span>posts</span><span>'</span> <span>not</span> <span>in</span> <span>data</span><span>:</span>
<span>context</span><span>.</span><span>log</span><span>.</span><span>warning</span><span>(</span><span>f</span><span>'</span><span>No posts found in response: </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span>'</span><span>)</span>
<span>return</span>
<span>user_requests</span> <span>=</span> <span>{}</span>
<span>posts</span> <span>=</span> <span>[]</span>
<span>profile_url</span> <span>=</span> <span>URL</span><span>(</span><span>f</span><span>'</span><span>{</span><span>self</span><span>.</span><span>_service_endpoint</span><span>}</span><span>/xrpc/app.bsky.actor.getProfile</span><span>'</span><span>)</span>
<span>for</span> <span>post</span> <span>in</span> <span>data</span><span>[</span><span>'</span><span>posts</span><span>'</span><span>]:</span>
<span>if</span> <span>self</span><span>.</span><span>mode</span> <span>==</span> <span>'</span><span>users</span><span>'</span> <span>and</span> <span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>]</span> <span>not</span> <span>in</span> <span>user_requests</span><span>:</span>
<span>user_requests</span><span>[</span><span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>]]</span> <span>=</span> <span>Request</span><span>.</span><span>from_url</span><span>(</span>
<span>url</span><span>=</span><span>str</span><span>(</span><span>profile_url</span><span>.</span><span>with_query</span><span>(</span><span>actor</span><span>=</span><span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>])),</span>
<span>user_data</span><span>=</span><span>{</span><span>'</span><span>label</span><span>'</span><span>:</span> <span>'</span><span>user</span><span>'</span><span>},</span>
<span>)</span>
<span>elif</span> <span>self</span><span>.</span><span>mode</span> <span>==</span> <span>'</span><span>posts</span><span>'</span><span>:</span>
<span>posts</span><span>.</span><span>append</span><span>(</span>
<span>{</span>
<span>'</span><span>uri</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>uri</span><span>'</span><span>],</span>
<span>'</span><span>cid</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>cid</span><span>'</span><span>],</span>
<span>'</span><span>author_did</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>],</span>
<span>'</span><span>created</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>][</span><span>'</span><span>createdAt</span><span>'</span><span>],</span>
<span>'</span><span>indexed</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>indexedAt</span><span>'</span><span>],</span>
<span>'</span><span>reply_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>replyCount</span><span>'</span><span>],</span>
<span>'</span><span>repost_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>repostCount</span><span>'</span><span>],</span>
<span>'</span><span>like_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>likeCount</span><span>'</span><span>],</span>
<span>'</span><span>quote_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>quoteCount</span><span>'</span><span>],</span>
<span>'</span><span>text</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>][</span><span>'</span><span>text</span><span>'</span><span>],</span>
<span>'</span><span>langs</span><span>'</span><span>:</span> <span>'</span><span>; </span><span>'</span><span>.</span><span>join</span><span>(</span><span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>langs</span><span>'</span><span>,</span> <span>[])),</span>
<span>'</span><span>reply_parent</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>reply</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>parent</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>uri</span><span>'</span><span>),</span>
<span>'</span><span>reply_root</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>reply</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>root</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>uri</span><span>'</span><span>),</span>
<span>}</span>
<span>)</span>
<span>if</span> <span>self</span><span>.</span><span>mode</span> <span>==</span> <span>'</span><span>posts</span><span>'</span><span>:</span>
<span>await</span> <span>context</span><span>.</span><span>push_data</span><span>(</span><span>posts</span><span>)</span>
<span>else</span><span>:</span>
<span>await</span> <span>context</span><span>.</span><span>add_requests</span><span>(</span><span>list</span><span>(</span><span>user_requests</span><span>.</span><span>values</span><span>()))</span>
<span>if</span> <span>cursor</span> <span>:</span><span>=</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>cursor</span><span>'</span><span>):</span>
<span>next_url</span> <span>=</span> <span>URL</span><span>(</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>).</span><span>update_query</span><span>({</span><span>'</span><span>cursor</span><span>'</span><span>:</span> <span>cursor</span><span>})</span>
<span>await</span> <span>context</span><span>.</span><span>add_requests</span><span>([</span><span>str</span><span>(</span><span>next_url</span><span>)])</span>
<span>async</span> <span>def</span> <span>_search_handler</span><span>(</span><span>self</span><span>,</span> <span>context</span><span>:</span> <span>HttpCrawlingContext</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Handle search requests based on mode.</span><span>"""</span>
    <span>context</span><span>.</span><span>log</span><span>.</span><span>info</span><span>(</span><span>f</span><span>'</span><span>Processing search </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span> ...</span><span>'</span><span>)</span>

    <span>data</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>context</span><span>.</span><span>http_response</span><span>.</span><span>read</span><span>())</span>

    <span>if</span> <span>'</span><span>posts</span><span>'</span> <span>not</span> <span>in</span> <span>data</span><span>:</span>
        <span>context</span><span>.</span><span>log</span><span>.</span><span>warning</span><span>(</span><span>f</span><span>'</span><span>No posts found in response: </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span>'</span><span>)</span>
        <span>return</span>

    <span>user_requests</span> <span>=</span> <span>{}</span>
    <span>posts</span> <span>=</span> <span>[]</span>

    <span>profile_url</span> <span>=</span> <span>URL</span><span>(</span><span>f</span><span>'</span><span>{</span><span>self</span><span>.</span><span>_service_endpoint</span><span>}</span><span>/xrpc/app.bsky.actor.getProfile</span><span>'</span><span>)</span>

    <span>for</span> <span>post</span> <span>in</span> <span>data</span><span>[</span><span>'</span><span>posts</span><span>'</span><span>]:</span>
        <span>if</span> <span>self</span><span>.</span><span>mode</span> <span>==</span> <span>'</span><span>users</span><span>'</span> <span>and</span> <span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>]</span> <span>not</span> <span>in</span> <span>user_requests</span><span>:</span>
            <span>user_requests</span><span>[</span><span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>]]</span> <span>=</span> <span>Request</span><span>.</span><span>from_url</span><span>(</span>
                <span>url</span><span>=</span><span>str</span><span>(</span><span>profile_url</span><span>.</span><span>with_query</span><span>(</span><span>actor</span><span>=</span><span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>])),</span>
                <span>user_data</span><span>=</span><span>{</span><span>'</span><span>label</span><span>'</span><span>:</span> <span>'</span><span>user</span><span>'</span><span>},</span>
            <span>)</span>
        <span>elif</span> <span>self</span><span>.</span><span>mode</span> <span>==</span> <span>'</span><span>posts</span><span>'</span><span>:</span>
            <span>posts</span><span>.</span><span>append</span><span>(</span>
                <span>{</span>
                    <span>'</span><span>uri</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>uri</span><span>'</span><span>],</span>
                    <span>'</span><span>cid</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>cid</span><span>'</span><span>],</span>
                    <span>'</span><span>author_did</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>author</span><span>'</span><span>][</span><span>'</span><span>did</span><span>'</span><span>],</span>
                    <span>'</span><span>created</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>][</span><span>'</span><span>createdAt</span><span>'</span><span>],</span>
                    <span>'</span><span>indexed</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>indexedAt</span><span>'</span><span>],</span>
                    <span>'</span><span>reply_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>replyCount</span><span>'</span><span>],</span>
                    <span>'</span><span>repost_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>repostCount</span><span>'</span><span>],</span>
                    <span>'</span><span>like_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>likeCount</span><span>'</span><span>],</span>
                    <span>'</span><span>quote_count</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>quoteCount</span><span>'</span><span>],</span>
                    <span>'</span><span>text</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>][</span><span>'</span><span>text</span><span>'</span><span>],</span>
                    <span>'</span><span>langs</span><span>'</span><span>:</span> <span>'</span><span>; </span><span>'</span><span>.</span><span>join</span><span>(</span><span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>langs</span><span>'</span><span>,</span> <span>[])),</span>
                    <span>'</span><span>reply_parent</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>reply</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>parent</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>uri</span><span>'</span><span>),</span>
                    <span>'</span><span>reply_root</span><span>'</span><span>:</span> <span>post</span><span>[</span><span>'</span><span>record</span><span>'</span><span>].</span><span>get</span><span>(</span><span>'</span><span>reply</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>root</span><span>'</span><span>,</span> <span>{}).</span><span>get</span><span>(</span><span>'</span><span>uri</span><span>'</span><span>),</span>
                <span>}</span>
            <span>)</span>

    <span>if</span> <span>self</span><span>.</span><span>mode</span> <span>==</span> <span>'</span><span>posts</span><span>'</span><span>:</span>
        <span>await</span> <span>context</span><span>.</span><span>push_data</span><span>(</span><span>posts</span><span>)</span>
    <span>else</span><span>:</span>
        <span>await</span> <span>context</span><span>.</span><span>add_requests</span><span>(</span><span>list</span><span>(</span><span>user_requests</span><span>.</span><span>values</span><span>()))</span>

    <span>if</span> <span>cursor</span> <span>:</span><span>=</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>cursor</span><span>'</span><span>):</span>
        <span>next_url</span> <span>=</span> <span>URL</span><span>(</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>).</span><span>update_query</span><span>({</span><span>'</span><span>cursor</span><span>'</span><span>:</span> <span>cursor</span><span>})</span>
        <span>await</span> <span>context</span><span>.</span><span>add_requests</span><span>([</span><span>str</span><span>(</span><span>next_url</span><span>)])</span>
async def _search_handler(self, context: HttpCrawlingContext) -> None: """Handle search requests based on mode.""" context.log.info(f'Processing search {context.request.url} ...') data = json.loads(context.http_response.read()) if 'posts' not in data: context.log.warning(f'No posts found in response: {context.request.url}') return user_requests = {} posts = [] profile_url = URL(f'{self._service_endpoint}/xrpc/app.bsky.actor.getProfile') for post in data['posts']: if self.mode == 'users' and post['author']['did'] not in user_requests: user_requests[post['author']['did']] = Request.from_url( url=str(profile_url.with_query(actor=post['author']['did'])), user_data={'label': 'user'}, ) elif self.mode == 'posts': posts.append( { 'uri': post['uri'], 'cid': post['cid'], 'author_did': post['author']['did'], 'created': post['record']['createdAt'], 'indexed': post['indexedAt'], 'reply_count': post['replyCount'], 'repost_count': post['repostCount'], 'like_count': post['likeCount'], 'quote_count': post['quoteCount'], 'text': post['record']['text'], 'langs': '; '.join(post['record'].get('langs', [])), 'reply_parent': post['record'].get('reply', {}).get('parent', {}).get('uri'), 'reply_root': post['record'].get('reply', {}).get('root', {}).get('uri'), } ) if self.mode == 'posts': await context.push_data(posts) else: await context.add_requests(list(user_requests.values())) if cursor := data.get('cursor'): next_url = URL(context.request.url).update_query({'cursor': cursor}) await context.add_requests([str(next_url)])

Enter fullscreen mode Exit fullscreen mode

Update the user handler for the default dataset:

<span>async</span> <span>def</span> <span>_user_handler</span><span>(</span><span>self</span><span>,</span> <span>context</span><span>:</span> <span>HttpCrawlingContext</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
<span>"""</span><span>Handle user profile requests.</span><span>"""</span>
<span>context</span><span>.</span><span>log</span><span>.</span><span>info</span><span>(</span><span>f</span><span>'</span><span>Processing user </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span> ...</span><span>'</span><span>)</span>
<span>data</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>context</span><span>.</span><span>http_response</span><span>.</span><span>read</span><span>())</span>
<span>user_item</span> <span>=</span> <span>{</span>
<span>'</span><span>did</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>did</span><span>'</span><span>],</span>
<span>'</span><span>created</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>createdAt</span><span>'</span><span>],</span>
<span>'</span><span>avatar</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>avatar</span><span>'</span><span>),</span>
<span>'</span><span>description</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>description</span><span>'</span><span>),</span>
<span>'</span><span>display_name</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>displayName</span><span>'</span><span>),</span>
<span>'</span><span>handle</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>handle</span><span>'</span><span>],</span>
<span>'</span><span>indexed</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>indexedAt</span><span>'</span><span>),</span>
<span>'</span><span>posts_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>postsCount</span><span>'</span><span>],</span>
<span>'</span><span>followers_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>followersCount</span><span>'</span><span>],</span>
<span>'</span><span>follows_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>followsCount</span><span>'</span><span>],</span>
<span>}</span>
<span>await</span> <span>context</span><span>.</span><span>push_data</span><span>(</span><span>user_item</span><span>)</span>
<span>async</span> <span>def</span> <span>_user_handler</span><span>(</span><span>self</span><span>,</span> <span>context</span><span>:</span> <span>HttpCrawlingContext</span><span>)</span> <span>-></span> <span>None</span><span>:</span>
    <span>"""</span><span>Handle user profile requests.</span><span>"""</span>
    <span>context</span><span>.</span><span>log</span><span>.</span><span>info</span><span>(</span><span>f</span><span>'</span><span>Processing user </span><span>{</span><span>context</span><span>.</span><span>request</span><span>.</span><span>url</span><span>}</span><span> ...</span><span>'</span><span>)</span>

    <span>data</span> <span>=</span> <span>json</span><span>.</span><span>loads</span><span>(</span><span>context</span><span>.</span><span>http_response</span><span>.</span><span>read</span><span>())</span>

    <span>user_item</span> <span>=</span> <span>{</span>
        <span>'</span><span>did</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>did</span><span>'</span><span>],</span>
        <span>'</span><span>created</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>createdAt</span><span>'</span><span>],</span>
        <span>'</span><span>avatar</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>avatar</span><span>'</span><span>),</span>
        <span>'</span><span>description</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>description</span><span>'</span><span>),</span>
        <span>'</span><span>display_name</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>displayName</span><span>'</span><span>),</span>
        <span>'</span><span>handle</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>handle</span><span>'</span><span>],</span>
        <span>'</span><span>indexed</span><span>'</span><span>:</span> <span>data</span><span>.</span><span>get</span><span>(</span><span>'</span><span>indexedAt</span><span>'</span><span>),</span>
        <span>'</span><span>posts_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>postsCount</span><span>'</span><span>],</span>
        <span>'</span><span>followers_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>followersCount</span><span>'</span><span>],</span>
        <span>'</span><span>follows_count</span><span>'</span><span>:</span> <span>data</span><span>[</span><span>'</span><span>followsCount</span><span>'</span><span>],</span>
    <span>}</span>

    <span>await</span> <span>context</span><span>.</span><span>push_data</span><span>(</span><span>user_item</span><span>)</span>
async def _user_handler(self, context: HttpCrawlingContext) -> None: """Handle user profile requests.""" context.log.info(f'Processing user {context.request.url} ...') data = json.loads(context.http_response.read()) user_item = { 'did': data['did'], 'created': data['createdAt'], 'avatar': data.get('avatar'), 'description': data.get('description'), 'display_name': data.get('displayName'), 'handle': data['handle'], 'indexed': data.get('indexedAt'), 'posts_count': data['postsCount'], 'followers_count': data['followersCount'], 'follows_count': data['followsCount'], } await context.push_data(user_item)

Enter fullscreen mode Exit fullscreen mode

Deploy

Use the official Apify CLI to upload your code:

Authenticate using your API token from Apify Console:

apify login
apify login
apify login

Enter fullscreen mode Exit fullscreen mode

Choose “Enter API token manually” and paste your token.

Push the project to the platform:

apify push
apify push
apify push

Enter fullscreen mode Exit fullscreen mode

Now you can configure runs on Apify Platform.

Let’s perform a test run:

Fill in the input parameters:

Check that logging works correctly:

View results in the dataset:

If you want to make your Actor public and provide access to other users, potentially to earn income from it, follow this publishing guide for Apify Store.

Conclusion and repository access

We’ve created an efficient crawler for Bluesky using the official API. If you want to learn more this topic for regular data extraction from Bluesky, I recommend explorin custom feed generation – I think it opens up some interesting possibilities.

And if you need to quickly create a crawler that can retrieve data for various queries, you now have everything you need.

You can find the complete code in the repository

If you enjoyed this blog, feel free to support Crawlee for Python by starring the repository or joining the maintainer team.

Have questions or want to discuss implementation details? Join our Discord – our community of 10,000+ developers is there to help.

原文链接:How to scrape Bluesky with Python

© 版权声明
THE END
喜欢就支持一下吧
点赞14 分享
Someone to love, something to do, and something to hope for.
有爱的人,有喜欢的事业,有梦想
评论 抢沙发

请登录后发表评论

    暂无评论内容