Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini

TLDR:

Learn how to build an E-commerce scraper using crawl4ai’s LLM-based extraction and Pydantic models. The scraper fetches both listing data (names, prices) and detailed product information (specs, reviews) asynchronously.

Try the full code in Google Colab

Ever wanted to analyze E-commerce product data but found traditional web scraping too complex? In this guide, I’ll show you how to build a reliable scraper using modern Python tools. We’ll use crawl4ai for intelligent extraction and Pydantic for clean data modeling.

Why Crawl4AI and Pydantic?

Crawl4AI: A robust library that simplifies web crawling and scraping by leveraging AI-based extraction strategies.
Pydantic: A Python library for data validation and settings management, ensuring the scraped data adheres to predefined schemas.

Why Scrape Tokopedia?

Tokopedia is one of Indonesia’s largest e-commerce platforms – I am native here and I use this platform a lot, but I am not their employee or affiliated :). You can use any e-commerce as you wish. If you’re a developer intrigued by e-commerce analytics, market research, or automated data gathering, scraping these listings can be quite useful.

What Makes This Approach Different?

Instead of wrestling with complex CSS selectors or XPath queries, we’re using crawl4ai’s LLM-based extraction. This means:

More resilient to website changes
Cleaner, structured data output
Less maintenance headache

Setting Up Your Environment

First, let’s install our required packages:

%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic

Enter fullscreen mode Exit fullscreen mode

We’ll also need nest_asyncio for running async code in notebooks:

import crawl4ai
import asyncio
import nest_asyncio
nest_asyncio.apply()

Enter fullscreen mode Exit fullscreen mode

Defining Our Data Models

We’ll use Pydantic to define exactly what data we want to extract. Here are our two main models:

from pydantic import BaseModel, Field
from typing import List, Optional

class TokopediaListingItem(BaseModel):
    product_name: str = Field(..., description="Name of the product in listing.")
    product_url: str = Field(..., description="URL link to product detail.")
    price: str = Field(None, description="Price displayed in listing.")
    store_name: str = Field(None, description="Store name from listing.")
    rating: str = Field(None, description="Rating displayed in listing.")
    image_url: str = Field(None, description="Primary image from listing.")

class TokopediaProductDetail(BaseModel):
    product_name: str = Field(..., description="Name of product on detail page.")
    all_images: List[str] = Field(default_factory=list, description="List of all product image URLs.")
    specs: str = Field(None, description="Technical specifications or short info.")
    description: "str = Field(None, description=\"Long product description.\")"
    variants: List[str] = Field(default_factory=list, description="List of variants or color options.")
    satisfaction_percentage: Optional[str] = Field(None, description="Percentage of satisfied customers.")
    total_ratings: Optional[str] = Field(None, description="Number of ratings.")
    total_reviews: Optional[str] = Field(None, description="Number of reviews.")
    stock: Optional[str] = Field(None, description="Stock availability.")

Enter fullscreen mode Exit fullscreen mode

These models act as a contract for what data we expect to extract. They also provide automatic validation and clear documentation.

The Scraping Process

Our scraper works in two stages:

1. Crawling Product Listings

First, we fetch search results pages:

async def crawl_tokopedia_listings(query: str = "mouse-wireless", max_pages: int = 1):
    listing_strategy = LLMExtractionStrategy(
        provider="gemini/gemini-1.5-pro",
        api_token=os.getenv("GEMINI_API_KEY"),
        schema=TokopediaListingItem.model_json_schema(),
        instruction=(
            "Extract structured data for each product in the listing. "
            "Each product should have: product_name, product_url, price,"
            "store_name, rating (scale 1-5), image_url."
        ),
        verbose=True,
    )

    all_results = []

    async with AsyncWebCrawler(verbose=True) as crawler:
        for page in range(1, max_pages + 1):
            url = f"https://www.tokopedia.com/find/{query}?page={page}"
            result = await crawler.arun(
                url=url,
                extraction_strategy=listing_strategy,
                word_count_threshold=1,
                cache_mode=CacheMode.DISABLED,
            )
            data = json.loads(result.extracted_content)
            all_results.extend(data)

    return all_results

Enter fullscreen mode Exit fullscreen mode

2. Fetching Product Details

Then, for each product URL we found, we fetch its detailed information:

async def crawl_tokopedia_detail(product_url: str):
    detail_strategy = LLMExtractionStrategy(
        provider="gemini/gemini-1.5-pro",
        api_token=os.getenv("GEMINI_API_KEY"),
        schema=TokopediaProductDetail.model_json_schema(),
        instruction=(
            "Extract fields like product_name, all_images (list), specs,"
            "description, variants (list), satisfaction_percentage,"
            "total_ratings, total_reviews, stock availability."
        ),
        verbose=False,
    )

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=product_url,
            extraction_strategy=detail_strategy,
            word_count_threshold=1,
            cache_mode=CacheMode.DISABLED,
        )

        parsed_data = json.loads(result.extracted_content)
        return TokopediaProductDetail(**parsed_data)

Enter fullscreen mode Exit fullscreen mode

Putting It All Together

Finally, we combine both stages into a single function:

async def run_full_scrape(query="mouse-wireless", max_pages=2, limit=15):
    listings = await crawl_tokopedia_listings(query=query, max_pages=max_pages)
    listings_subset = listings[:limit]

    all_data = []
    for i, item in enumerate(listings_subset, start=1):
        detail_data = await crawl_tokopedia_detail(item["product_url"])
        combined_data = {
            "listing_data": item,
            "detail_data": detail_data.dict(),
        }
        all_data.append(combined_data)
        print(f"[Detail] Scraped {i}/{len(listings_subset)}")

    return all_data

Enter fullscreen mode Exit fullscreen mode

Running the Scraper

Here’s how to use it:

# Scrape first 5 products from page 1 results = await run_full_scrape("mouse-wireless", max_pages=1, limit=5)

# Print results nicely formatted for result in results:
    print(json.dumps(result, indent=4))

Enter fullscreen mode Exit fullscreen mode

Pro Tips

Rate Limiting: Be respectful of Tokopedia’s servers. Add delays between requests if scraping many pages.
Caching: Enable crawl4ai’s cache during development:

cache_mode=CacheMode.ENABLED

Enter fullscreen mode Exit fullscreen mode

Error Handling: The code includes basic error handling, but you might want to add retries for production use.
API Keys: Store your Gemini API key in environment variables, not in the code.

What’s Next?

You could extend this scraper to:

Save data to a database
Track price changes over time
Analyze product trends
Compare prices across stores

Wrapping up

Using crawl4ai with LLM-based extraction makes web scraping much more maintainable than traditional methods. The combination with Pydantic ensures your data is well-structured and validated.

Remember to always check a website’s robots.txt and terms of service before scraping. Happy coding!

Important links:

Crawl4AI

Official Website: https://crawl4ai.com
GitHub Repository: https://github.com/unclecode/crawl4ai
Documentation: https://crawl4ai.com/mkdocs/core/installation/

Pydantic

Official Documentation: https://docs.pydantic.dev/latest/
PyPI Page: https://pypi.org/project/pydantic/
GitHub Repository: https://github.com/pydantic/pydantic

Note: The complete code is available in the Colab notebook. Feel free to try it out and adapt it for your needs.

原文链接：Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END