Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini

TLDR:

Learn how to build an E-commerce scraper using crawl4ai’s LLM-based extraction and Pydantic models. The scraper fetches both listing data (names, prices) and detailed product information (specs, reviews) asynchronously.

Try the full code in Google Colab


Ever wanted to analyze E-commerce product data but found traditional web scraping too complex? In this guide, I’ll show you how to build a reliable scraper using modern Python tools. We’ll use crawl4ai for intelligent extraction and Pydantic for clean data modeling.

Why Crawl4AI and Pydantic?

  • Crawl4AI: A robust library that simplifies web crawling and scraping by leveraging AI-based extraction strategies.
  • Pydantic: A Python library for data validation and settings management, ensuring the scraped data adheres to predefined schemas.

Why Scrape Tokopedia?

Tokopedia is one of Indonesia’s largest e-commerce platforms – I am native here and I use this platform a lot, but I am not their employee or affiliated :). You can use any e-commerce as you wish. If you’re a developer intrigued by e-commerce analytics, market research, or automated data gathering, scraping these listings can be quite useful.

What Makes This Approach Different?

Instead of wrestling with complex CSS selectors or XPath queries, we’re using crawl4ai’s LLM-based extraction. This means:

  • More resilient to website changes
  • Cleaner, structured data output
  • Less maintenance headache

Setting Up Your Environment

First, let’s install our required packages:

%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic

Enter fullscreen mode Exit fullscreen mode

We’ll also need nest_asyncio for running async code in notebooks:

import crawl4ai
import asyncio
import nest_asyncio
nest_asyncio.apply()

Enter fullscreen mode Exit fullscreen mode

Defining Our Data Models

We’ll use Pydantic to define exactly what data we want to extract. Here are our two main models:

from pydantic import BaseModel, Field
from typing import List, Optional

class TokopediaListingItem(BaseModel):
    product_name: str = Field(..., description="Name of the product in listing.")
    product_url: str = Field(..., description="URL link to product detail.")
    price: str = Field(None, description="Price displayed in listing.")
    store_name: str = Field(None, description="Store name from listing.")
    rating: str = Field(None, description="Rating displayed in listing.")
    image_url: str = Field(None, description="Primary image from listing.")

class TokopediaProductDetail(BaseModel):
    product_name: str = Field(..., description="Name of product on detail page.")
    all_images: List[str] = Field(default_factory=list, description="List of all product image URLs.")
    specs: str = Field(None, description="Technical specifications or short info.")
    description: "str = Field(None, description=\"Long product description.\")"
    variants: List[str] = Field(default_factory=list, description="List of variants or color options.")
    satisfaction_percentage: Optional[str] = Field(None, description="Percentage of satisfied customers.")
    total_ratings: Optional[str] = Field(None, description="Number of ratings.")
    total_reviews: Optional[str] = Field(None, description="Number of reviews.")
    stock: Optional[str] = Field(None, description="Stock availability.")

Enter fullscreen mode Exit fullscreen mode

These models act as a contract for what data we expect to extract. They also provide automatic validation and clear documentation.

The Scraping Process

Our scraper works in two stages:

1. Crawling Product Listings

First, we fetch search results pages:

async def crawl_tokopedia_listings(query: str = "mouse-wireless", max_pages: int = 1):
    listing_strategy = LLMExtractionStrategy(
        provider="gemini/gemini-1.5-pro",
        api_token=os.getenv("GEMINI_API_KEY"),
        schema=TokopediaListingItem.model_json_schema(),
        instruction=(
            "Extract structured data for each product in the listing. "
            "Each product should have: product_name, product_url, price,"
            "store_name, rating (scale 1-5), image_url."
        ),
        verbose=True,
    )

    all_results = []

    async with AsyncWebCrawler(verbose=True) as crawler:
        for page in range(1, max_pages + 1):
            url = f"https://www.tokopedia.com/find/{query}?page={page}"
            result = await crawler.arun(
                url=url,
                extraction_strategy=listing_strategy,
                word_count_threshold=1,
                cache_mode=CacheMode.DISABLED,
            )
            data = json.loads(result.extracted_content)
            all_results.extend(data)

    return all_results

Enter fullscreen mode Exit fullscreen mode

2. Fetching Product Details

Then, for each product URL we found, we fetch its detailed information:

async def crawl_tokopedia_detail(product_url: str):
    detail_strategy = LLMExtractionStrategy(
        provider="gemini/gemini-1.5-pro",
        api_token=os.getenv("GEMINI_API_KEY"),
        schema=TokopediaProductDetail.model_json_schema(),
        instruction=(
            "Extract fields like product_name, all_images (list), specs,"
            "description, variants (list), satisfaction_percentage,"
            "total_ratings, total_reviews, stock availability."
        ),
        verbose=False,
    )

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=product_url,
            extraction_strategy=detail_strategy,
            word_count_threshold=1,
            cache_mode=CacheMode.DISABLED,
        )

        parsed_data = json.loads(result.extracted_content)
        return TokopediaProductDetail(**parsed_data)

Enter fullscreen mode Exit fullscreen mode

Putting It All Together

Finally, we combine both stages into a single function:

async def run_full_scrape(query="mouse-wireless", max_pages=2, limit=15):
    listings = await crawl_tokopedia_listings(query=query, max_pages=max_pages)
    listings_subset = listings[:limit]

    all_data = []
    for i, item in enumerate(listings_subset, start=1):
        detail_data = await crawl_tokopedia_detail(item["product_url"])
        combined_data = {
            "listing_data": item,
            "detail_data": detail_data.dict(),
        }
        all_data.append(combined_data)
        print(f"[Detail] Scraped {i}/{len(listings_subset)}")

    return all_data

Enter fullscreen mode Exit fullscreen mode

Running the Scraper

Here’s how to use it:

# Scrape first 5 products from page 1 results = await run_full_scrape("mouse-wireless", max_pages=1, limit=5)

# Print results nicely formatted for result in results:
    print(json.dumps(result, indent=4))

Enter fullscreen mode Exit fullscreen mode

Pro Tips

  1. Rate Limiting: Be respectful of Tokopedia’s servers. Add delays between requests if scraping many pages.

  2. Caching: Enable crawl4ai’s cache during development:

cache_mode=CacheMode.ENABLED

Enter fullscreen mode Exit fullscreen mode

  1. Error Handling: The code includes basic error handling, but you might want to add retries for production use.

  2. API Keys: Store your Gemini API key in environment variables, not in the code.

What’s Next?

You could extend this scraper to:

  • Save data to a database
  • Track price changes over time
  • Analyze product trends
  • Compare prices across stores

Wrapping up

Using crawl4ai with LLM-based extraction makes web scraping much more maintainable than traditional methods. The combination with Pydantic ensures your data is well-structured and validated.

Remember to always check a website’s robots.txt and terms of service before scraping. Happy coding!


Important links:

Crawl4AI

Pydantic


Note: The complete code is available in the Colab notebook. Feel free to try it out and adapt it for your needs.

原文链接:Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini

© 版权声明
THE END
喜欢就支持一下吧
点赞13 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容