TLDR:
Learn how to build an E-commerce scraper using crawl4ai’s LLM-based extraction and Pydantic models. The scraper fetches both listing data (names, prices) and detailed product information (specs, reviews) asynchronously.
Try the full code in Google Colab
Ever wanted to analyze E-commerce product data but found traditional web scraping too complex? In this guide, I’ll show you how to build a reliable scraper using modern Python tools. We’ll use crawl4ai for intelligent extraction and Pydantic for clean data modeling.
Why Crawl4AI and Pydantic?
- Crawl4AI: A robust library that simplifies web crawling and scraping by leveraging AI-based extraction strategies.
- Pydantic: A Python library for data validation and settings management, ensuring the scraped data adheres to predefined schemas.
Why Scrape Tokopedia?
Tokopedia is one of Indonesia’s largest e-commerce platforms – I am native here and I use this platform a lot, but I am not their employee or affiliated :). You can use any e-commerce as you wish. If you’re a developer intrigued by e-commerce analytics, market research, or automated data gathering, scraping these listings can be quite useful.
What Makes This Approach Different?
Instead of wrestling with complex CSS selectors or XPath queries, we’re using crawl4ai’s LLM-based extraction. This means:
- More resilient to website changes
- Cleaner, structured data output
- Less maintenance headache
Setting Up Your Environment
First, let’s install our required packages:
%pip install -U crawl4ai
%pip install nest_asyncio
%pip install pydantic
Enter fullscreen mode Exit fullscreen mode
We’ll also need nest_asyncio
for running async code in notebooks:
import crawl4ai
import asyncio
import nest_asyncio
nest_asyncio.apply()
Enter fullscreen mode Exit fullscreen mode
Defining Our Data Models
We’ll use Pydantic to define exactly what data we want to extract. Here are our two main models:
from pydantic import BaseModel, Field
from typing import List, Optional
class TokopediaListingItem(BaseModel):
product_name: str = Field(..., description="Name of the product in listing.")
product_url: str = Field(..., description="URL link to product detail.")
price: str = Field(None, description="Price displayed in listing.")
store_name: str = Field(None, description="Store name from listing.")
rating: str = Field(None, description="Rating displayed in listing.")
image_url: str = Field(None, description="Primary image from listing.")
class TokopediaProductDetail(BaseModel):
product_name: str = Field(..., description="Name of product on detail page.")
all_images: List[str] = Field(default_factory=list, description="List of all product image URLs.")
specs: str = Field(None, description="Technical specifications or short info.")
description: "str = Field(None, description=\"Long product description.\")"
variants: List[str] = Field(default_factory=list, description="List of variants or color options.")
satisfaction_percentage: Optional[str] = Field(None, description="Percentage of satisfied customers.")
total_ratings: Optional[str] = Field(None, description="Number of ratings.")
total_reviews: Optional[str] = Field(None, description="Number of reviews.")
stock: Optional[str] = Field(None, description="Stock availability.")
Enter fullscreen mode Exit fullscreen mode
These models act as a contract for what data we expect to extract. They also provide automatic validation and clear documentation.
The Scraping Process
Our scraper works in two stages:
1. Crawling Product Listings
First, we fetch search results pages:
async def crawl_tokopedia_listings(query: str = "mouse-wireless", max_pages: int = 1):
listing_strategy = LLMExtractionStrategy(
provider="gemini/gemini-1.5-pro",
api_token=os.getenv("GEMINI_API_KEY"),
schema=TokopediaListingItem.model_json_schema(),
instruction=(
"Extract structured data for each product in the listing. "
"Each product should have: product_name, product_url, price,"
"store_name, rating (scale 1-5), image_url."
),
verbose=True,
)
all_results = []
async with AsyncWebCrawler(verbose=True) as crawler:
for page in range(1, max_pages + 1):
url = f"https://www.tokopedia.com/find/{query}?page={page}"
result = await crawler.arun(
url=url,
extraction_strategy=listing_strategy,
word_count_threshold=1,
cache_mode=CacheMode.DISABLED,
)
data = json.loads(result.extracted_content)
all_results.extend(data)
return all_results
Enter fullscreen mode Exit fullscreen mode
2. Fetching Product Details
Then, for each product URL we found, we fetch its detailed information:
async def crawl_tokopedia_detail(product_url: str):
detail_strategy = LLMExtractionStrategy(
provider="gemini/gemini-1.5-pro",
api_token=os.getenv("GEMINI_API_KEY"),
schema=TokopediaProductDetail.model_json_schema(),
instruction=(
"Extract fields like product_name, all_images (list), specs,"
"description, variants (list), satisfaction_percentage,"
"total_ratings, total_reviews, stock availability."
),
verbose=False,
)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=product_url,
extraction_strategy=detail_strategy,
word_count_threshold=1,
cache_mode=CacheMode.DISABLED,
)
parsed_data = json.loads(result.extracted_content)
return TokopediaProductDetail(**parsed_data)
Enter fullscreen mode Exit fullscreen mode
Putting It All Together
Finally, we combine both stages into a single function:
async def run_full_scrape(query="mouse-wireless", max_pages=2, limit=15):
listings = await crawl_tokopedia_listings(query=query, max_pages=max_pages)
listings_subset = listings[:limit]
all_data = []
for i, item in enumerate(listings_subset, start=1):
detail_data = await crawl_tokopedia_detail(item["product_url"])
combined_data = {
"listing_data": item,
"detail_data": detail_data.dict(),
}
all_data.append(combined_data)
print(f"[Detail] Scraped {i}/{len(listings_subset)}")
return all_data
Enter fullscreen mode Exit fullscreen mode
Running the Scraper
Here’s how to use it:
# Scrape first 5 products from page 1 results = await run_full_scrape("mouse-wireless", max_pages=1, limit=5)
# Print results nicely formatted for result in results:
print(json.dumps(result, indent=4))
Enter fullscreen mode Exit fullscreen mode
Pro Tips
-
Rate Limiting: Be respectful of Tokopedia’s servers. Add delays between requests if scraping many pages.
-
Caching: Enable crawl4ai’s cache during development:
cache_mode=CacheMode.ENABLED
Enter fullscreen mode Exit fullscreen mode
-
Error Handling: The code includes basic error handling, but you might want to add retries for production use.
-
API Keys: Store your Gemini API key in environment variables, not in the code.
What’s Next?
You could extend this scraper to:
- Save data to a database
- Track price changes over time
- Analyze product trends
- Compare prices across stores
Wrapping up
Using crawl4ai with LLM-based extraction makes web scraping much more maintainable than traditional methods. The combination with Pydantic ensures your data is well-structured and validated.
Remember to always check a website’s robots.txt and terms of service before scraping. Happy coding!
Important links:
Crawl4AI
- Official Website: https://crawl4ai.com
- GitHub Repository: https://github.com/unclecode/crawl4ai
- Documentation: https://crawl4ai.com/mkdocs/core/installation/
Pydantic
- Official Documentation: https://docs.pydantic.dev/latest/
- PyPI Page: https://pypi.org/project/pydantic/
- GitHub Repository: https://github.com/pydantic/pydantic
Note: The complete code is available in the Colab notebook. Feel free to try it out and adapt it for your needs.
原文链接:Building an Async E-Commerce Web Scraper with Pydantic, Crawl4ai & Gemini
暂无评论内容