Introduction
This article explores the architecture and implementation of a high-performance web scraper built to extract product data from e-commerce platforms. The scraper uses multiple Python libraries and techniques to efficiently process thousands of products while maintaining resilience against common scraping challenges.
Technical Architecture
The scraper is built on a fully asynchronous foundation using Python’s asyncio
ecosystem, with these key components:
- Network Layer:
aiohttp
for async HTTP requests with connection pooling - DOM Processing:
BeautifulSoup4
for HTML parsing - Dynamic Content:
Playwright
for JavaScript-rendered content extraction - Data Processing:
pandas
for data manipulation and export
Implementation Highlights
Concurrency Management
The scraper implements a worker pool pattern with configurable concurrency limits:
<span># Concurrency settings </span><span>self</span><span>.</span><span>max_workers</span> <span>=</span> <span>int</span><span>(</span><span>os</span><span>.</span><span>getenv</span><span>(</span><span>'</span><span>MAX_WORKERS</span><span>'</span><span>))</span><span>self</span><span>.</span><span>max_connections</span> <span>=</span> <span>int</span><span>(</span><span>os</span><span>.</span><span>getenv</span><span>(</span><span>'</span><span>MAX_CONNECTIONS</span><span>'</span><span>))</span><span># TCP connection pooling </span><span>connector</span> <span>=</span> <span>aiohttp</span><span>.</span><span>TCPConnector</span><span>(</span><span>limit</span><span>=</span><span>self</span><span>.</span><span>max_connections</span><span>,</span><span>resolver</span><span>=</span><span>resolver</span> <span># Custom DNS resolver </span><span>)</span><span># Concurrency settings </span><span>self</span><span>.</span><span>max_workers</span> <span>=</span> <span>int</span><span>(</span><span>os</span><span>.</span><span>getenv</span><span>(</span><span>'</span><span>MAX_WORKERS</span><span>'</span><span>))</span> <span>self</span><span>.</span><span>max_connections</span> <span>=</span> <span>int</span><span>(</span><span>os</span><span>.</span><span>getenv</span><span>(</span><span>'</span><span>MAX_CONNECTIONS</span><span>'</span><span>))</span> <span># TCP connection pooling </span><span>connector</span> <span>=</span> <span>aiohttp</span><span>.</span><span>TCPConnector</span><span>(</span> <span>limit</span><span>=</span><span>self</span><span>.</span><span>max_connections</span><span>,</span> <span>resolver</span><span>=</span><span>resolver</span> <span># Custom DNS resolver </span><span>)</span># Concurrency settings self.max_workers = int(os.getenv('MAX_WORKERS')) self.max_connections = int(os.getenv('MAX_CONNECTIONS')) # TCP connection pooling connector = aiohttp.TCPConnector( limit=self.max_connections, resolver=resolver # Custom DNS resolver )
Enter fullscreen mode Exit fullscreen mode
This prevents overwhelming the target server while maximizing throughput.
Resilient Network Requests
The network layer implements sophisticated retry logic with exponential backoff:
<span>async</span> <span>def</span> <span>fetch_url</span><span>(</span><span>self</span><span>,</span> <span>session</span><span>,</span> <span>url</span><span>):</span><span>retries</span> <span>=</span> <span>0</span><span>while</span> <span>retries</span> <span><</span> <span>self</span><span>.</span><span>max_retries</span><span>:</span><span>try</span><span>:</span><span>headers</span> <span>=</span> <span>{</span><span>'</span><span>User-Agent</span><span>'</span><span>:</span> <span>self</span><span>.</span><span>user_agent</span><span>.</span><span>random</span><span>,</span><span># Additional headers omitted for brevity </span> <span>}</span><span>async</span> <span>with</span> <span>session</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>,</span> <span>timeout</span><span>=</span><span>self</span><span>.</span><span>request_timeout</span><span>)</span> <span>as</span> <span>response</span><span>:</span><span>if</span> <span>response</span><span>.</span><span>status</span> <span>==</span> <span>200</span><span>:</span><span>return</span> <span>await</span> <span>response</span><span>.</span><span>read</span><span>()</span><span>elif</span> <span>response</span><span>.</span><span>status</span> <span>==</span> <span>429</span><span>:</span><span># Rate limit handling </span> <span>retry_after</span> <span>=</span> <span>int</span><span>(</span><span>response</span><span>.</span><span>headers</span><span>.</span><span>get</span><span>(</span><span>'</span><span>Retry-After</span><span>'</span><span>,</span><span>self</span><span>.</span><span>retry_backoff</span> <span>**</span> <span>(</span><span>retries</span> <span>+</span> <span>2</span><span>)))</span><span>await</span> <span>asyncio</span><span>.</span><span>sleep</span><span>(</span><span>retry_after</span><span>)</span><span># Retry with exponential backoff </span> <span>retries</span> <span>+=</span> <span>1</span><span>wait_time</span> <span>=</span> <span>self</span><span>.</span><span>retry_backoff</span> <span>**</span> <span>(</span><span>retries</span> <span>+</span> <span>1</span><span>)</span><span>await</span> <span>asyncio</span><span>.</span><span>sleep</span><span>(</span><span>wait_time</span><span>)</span><span>except </span><span>(</span><span>asyncio</span><span>.</span><span>TimeoutError</span><span>,</span> <span>aiohttp</span><span>.</span><span>ClientError</span><span>)</span> <span>as</span> <span>e</span><span>:</span><span>logger</span><span>.</span><span>warning</span><span>(</span><span>f</span><span>"</span><span>Network error: </span><span>{</span><span>e</span><span>}</span><span>"</span><span>)</span><span>retries</span> <span>+=</span> <span>1</span><span>async</span> <span>def</span> <span>fetch_url</span><span>(</span><span>self</span><span>,</span> <span>session</span><span>,</span> <span>url</span><span>):</span> <span>retries</span> <span>=</span> <span>0</span> <span>while</span> <span>retries</span> <span><</span> <span>self</span><span>.</span><span>max_retries</span><span>:</span> <span>try</span><span>:</span> <span>headers</span> <span>=</span> <span>{</span> <span>'</span><span>User-Agent</span><span>'</span><span>:</span> <span>self</span><span>.</span><span>user_agent</span><span>.</span><span>random</span><span>,</span> <span># Additional headers omitted for brevity </span> <span>}</span> <span>async</span> <span>with</span> <span>session</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>,</span> <span>timeout</span><span>=</span><span>self</span><span>.</span><span>request_timeout</span><span>)</span> <span>as</span> <span>response</span><span>:</span> <span>if</span> <span>response</span><span>.</span><span>status</span> <span>==</span> <span>200</span><span>:</span> <span>return</span> <span>await</span> <span>response</span><span>.</span><span>read</span><span>()</span> <span>elif</span> <span>response</span><span>.</span><span>status</span> <span>==</span> <span>429</span><span>:</span> <span># Rate limit handling </span> <span>retry_after</span> <span>=</span> <span>int</span><span>(</span><span>response</span><span>.</span><span>headers</span><span>.</span><span>get</span><span>(</span><span>'</span><span>Retry-After</span><span>'</span><span>,</span> <span>self</span><span>.</span><span>retry_backoff</span> <span>**</span> <span>(</span><span>retries</span> <span>+</span> <span>2</span><span>)))</span> <span>await</span> <span>asyncio</span><span>.</span><span>sleep</span><span>(</span><span>retry_after</span><span>)</span> <span># Retry with exponential backoff </span> <span>retries</span> <span>+=</span> <span>1</span> <span>wait_time</span> <span>=</span> <span>self</span><span>.</span><span>retry_backoff</span> <span>**</span> <span>(</span><span>retries</span> <span>+</span> <span>1</span><span>)</span> <span>await</span> <span>asyncio</span><span>.</span><span>sleep</span><span>(</span><span>wait_time</span><span>)</span> <span>except </span><span>(</span><span>asyncio</span><span>.</span><span>TimeoutError</span><span>,</span> <span>aiohttp</span><span>.</span><span>ClientError</span><span>)</span> <span>as</span> <span>e</span><span>:</span> <span>logger</span><span>.</span><span>warning</span><span>(</span><span>f</span><span>"</span><span>Network error: </span><span>{</span><span>e</span><span>}</span><span>"</span><span>)</span> <span>retries</span> <span>+=</span> <span>1</span>async def fetch_url(self, session, url): retries = 0 while retries < self.max_retries: try: headers = { 'User-Agent': self.user_agent.random, # Additional headers omitted for brevity } async with session.get(url, headers=headers, timeout=self.request_timeout) as response: if response.status == 200: return await response.read() elif response.status == 429: # Rate limit handling retry_after = int(response.headers.get('Retry-After', self.retry_backoff ** (retries + 2))) await asyncio.sleep(retry_after) # Retry with exponential backoff retries += 1 wait_time = self.retry_backoff ** (retries + 1) await asyncio.sleep(wait_time) except (asyncio.TimeoutError, aiohttp.ClientError) as e: logger.warning(f"Network error: {e}") retries += 1
Enter fullscreen mode Exit fullscreen mode
Hybrid Content Extraction
The scraper employs a two-phase extraction approach:
- Static HTML Parsing: Uses BeautifulSoup to extract readily available content
- Dynamic Content Extraction: Uses Playwright to handle JavaScript-rendered elements
<span>async</span> <span>def</span> <span>fetch_product</span><span>(</span><span>self</span><span>,</span> <span>session</span><span>,</span> <span>url</span><span>,</span> <span>page</span><span>):</span><span># Static content extraction </span> <span>with</span> <span>concurrent</span><span>.</span><span>futures</span><span>.</span><span>ThreadPoolExecutor</span><span>()</span> <span>as</span> <span>executor</span><span>:</span><span>loop</span> <span>=</span> <span>asyncio</span><span>.</span><span>get_event_loop</span><span>()</span><span>product</span> <span>=</span> <span>await</span> <span>loop</span><span>.</span><span>run_in_executor</span><span>(</span><span>executor</span><span>,</span><span>partial</span><span>(</span><span>self</span><span>.</span><span>scrape_product_html</span><span>,</span> <span>content</span><span>,</span> <span>url</span><span>)</span><span>)</span><span># Dynamic content extraction </span> <span>image_url</span><span>,</span> <span>description</span> <span>=</span> <span>await</span> <span>self</span><span>.</span><span>scrape_dynamic_content_playwright</span><span>(</span><span>page</span><span>,</span> <span>url</span><span>)</span><span>product</span><span>.</span><span>image_url</span> <span>=</span> <span>image_url</span><span>product</span><span>.</span><span>description</span> <span>=</span> <span>description</span><span>async</span> <span>def</span> <span>fetch_product</span><span>(</span><span>self</span><span>,</span> <span>session</span><span>,</span> <span>url</span><span>,</span> <span>page</span><span>):</span> <span># Static content extraction </span> <span>with</span> <span>concurrent</span><span>.</span><span>futures</span><span>.</span><span>ThreadPoolExecutor</span><span>()</span> <span>as</span> <span>executor</span><span>:</span> <span>loop</span> <span>=</span> <span>asyncio</span><span>.</span><span>get_event_loop</span><span>()</span> <span>product</span> <span>=</span> <span>await</span> <span>loop</span><span>.</span><span>run_in_executor</span><span>(</span> <span>executor</span><span>,</span> <span>partial</span><span>(</span><span>self</span><span>.</span><span>scrape_product_html</span><span>,</span> <span>content</span><span>,</span> <span>url</span><span>)</span> <span>)</span> <span># Dynamic content extraction </span> <span>image_url</span><span>,</span> <span>description</span> <span>=</span> <span>await</span> <span>self</span><span>.</span><span>scrape_dynamic_content_playwright</span><span>(</span><span>page</span><span>,</span> <span>url</span><span>)</span> <span>product</span><span>.</span><span>image_url</span> <span>=</span> <span>image_url</span> <span>product</span><span>.</span><span>description</span> <span>=</span> <span>description</span>async def fetch_product(self, session, url, page): # Static content extraction with concurrent.futures.ThreadPoolExecutor() as executor: loop = asyncio.get_event_loop() product = await loop.run_in_executor( executor, partial(self.scrape_product_html, content, url) ) # Dynamic content extraction image_url, description = await self.scrape_dynamic_content_playwright(page, url) product.image_url = image_url product.description = description
Enter fullscreen mode Exit fullscreen mode
This approach optimizes for both speed and completeness.
DNS Resilience
The scraper implements DNS fallbacks to handle potential DNS resolution issues:
<span>try</span><span>:</span><span>import</span> <span>aiodns</span><span>resolver</span> <span>=</span> <span>aiohttp</span><span>.</span><span>AsyncResolver</span><span>(</span><span>nameservers</span><span>=</span><span>[</span><span>"</span><span>8.8.8.8</span><span>"</span><span>,</span> <span>"</span><span>1.1.1.1</span><span>"</span><span>])</span><span>except</span> <span>ImportError</span><span>:</span><span>logger</span><span>.</span><span>warning</span><span>(</span><span>"</span><span>aiodns library not found. Falling back to default resolver.</span><span>"</span><span>)</span><span>resolver</span> <span>=</span> <span>None</span><span>try</span><span>:</span> <span>import</span> <span>aiodns</span> <span>resolver</span> <span>=</span> <span>aiohttp</span><span>.</span><span>AsyncResolver</span><span>(</span><span>nameservers</span><span>=</span><span>[</span><span>"</span><span>8.8.8.8</span><span>"</span><span>,</span> <span>"</span><span>1.1.1.1</span><span>"</span><span>])</span> <span>except</span> <span>ImportError</span><span>:</span> <span>logger</span><span>.</span><span>warning</span><span>(</span><span>"</span><span>aiodns library not found. Falling back to default resolver.</span><span>"</span><span>)</span> <span>resolver</span> <span>=</span> <span>None</span>try: import aiodns resolver = aiohttp.AsyncResolver(nameservers=["8.8.8.8", "1.1.1.1"]) except ImportError: logger.warning("aiodns library not found. Falling back to default resolver.") resolver = None
Enter fullscreen mode Exit fullscreen mode
Data Processing Pipeline
The scraper implements a thread-safe queue for handling scraped data:
<span># Thread-safe queue for results </span><span>self</span><span>.</span><span>results_queue</span> <span>=</span> <span>queue</span><span>.</span><span>Queue</span><span>()</span><span># Data processing </span><span>def</span> <span>save_results_from_queue</span><span>(</span><span>self</span><span>):</span><span>products</span> <span>=</span> <span>[]</span><span>while</span> <span>not</span> <span>self</span><span>.</span><span>results_queue</span><span>.</span><span>empty</span><span>():</span><span>try</span><span>:</span><span>products</span><span>.</span><span>append</span><span>(</span><span>self</span><span>.</span><span>results_queue</span><span>.</span><span>get_nowait</span><span>())</span><span>except</span> <span>queue</span><span>.</span><span>Empty</span><span>:</span><span>break</span><span>if</span> <span>products</span><span>:</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>products</span><span>)</span><span># Save to CSV with proper encoding and escaping </span> <span>df</span><span>.</span><span>to_csv</span><span>(</span><span>filename</span><span>,</span><span>index</span><span>=</span><span>False</span><span>,</span><span>encoding</span><span>=</span><span>'</span><span>utf-8-sig</span><span>'</span><span>,</span><span>escapechar</span><span>=</span><span>'</span><span>\\</span><span>'</span><span>,</span><span>quoting</span><span>=</span><span>csv</span><span>.</span><span>QUOTE_ALL</span><span>)</span><span># Thread-safe queue for results </span><span>self</span><span>.</span><span>results_queue</span> <span>=</span> <span>queue</span><span>.</span><span>Queue</span><span>()</span> <span># Data processing </span><span>def</span> <span>save_results_from_queue</span><span>(</span><span>self</span><span>):</span> <span>products</span> <span>=</span> <span>[]</span> <span>while</span> <span>not</span> <span>self</span><span>.</span><span>results_queue</span><span>.</span><span>empty</span><span>():</span> <span>try</span><span>:</span> <span>products</span><span>.</span><span>append</span><span>(</span><span>self</span><span>.</span><span>results_queue</span><span>.</span><span>get_nowait</span><span>())</span> <span>except</span> <span>queue</span><span>.</span><span>Empty</span><span>:</span> <span>break</span> <span>if</span> <span>products</span><span>:</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>products</span><span>)</span> <span># Save to CSV with proper encoding and escaping </span> <span>df</span><span>.</span><span>to_csv</span><span>(</span> <span>filename</span><span>,</span> <span>index</span><span>=</span><span>False</span><span>,</span> <span>encoding</span><span>=</span><span>'</span><span>utf-8-sig</span><span>'</span><span>,</span> <span>escapechar</span><span>=</span><span>'</span><span>\\</span><span>'</span><span>,</span> <span>quoting</span><span>=</span><span>csv</span><span>.</span><span>QUOTE_ALL</span> <span>)</span># Thread-safe queue for results self.results_queue = queue.Queue() # Data processing def save_results_from_queue(self): products = [] while not self.results_queue.empty(): try: products.append(self.results_queue.get_nowait()) except queue.Empty: break if products: df = pd.DataFrame(products) # Save to CSV with proper encoding and escaping df.to_csv( filename, index=False, encoding='utf-8-sig', escapechar='\\', quoting=csv.QUOTE_ALL )
Enter fullscreen mode Exit fullscreen mode
Performance Optimizations
Several techniques are employed to maximize throughput:
- Batch Processing: Products are processed in configurable batches
- Random Delays: Randomized delays between requests prevent detection
- Connection Pooling: TCP connection reuse reduces overhead
- ThreadPoolExecutor: CPU-bound tasks are offloaded to prevent blocking the event loop
- Sampling: For large datasets, statistical sampling is used to estimate total counts
Error Handling and Reliability
The scraper implements comprehensive error handling:
<span>try</span><span>:</span><span># Scraping logic </span><span>except</span> <span>Exception</span> <span>as</span> <span>e</span><span>:</span><span>logger</span><span>.</span><span>error</span><span>(</span><span>f</span><span>"</span><span>Error in scrape_all_products: </span><span>{</span><span>e</span><span>}</span><span>"</span><span>)</span><span># Save any results in queue before exiting </span> <span>self</span><span>.</span><span>save_results_from_queue</span><span>()</span><span>raise</span><span>try</span><span>:</span> <span># Scraping logic </span><span>except</span> <span>Exception</span> <span>as</span> <span>e</span><span>:</span> <span>logger</span><span>.</span><span>error</span><span>(</span><span>f</span><span>"</span><span>Error in scrape_all_products: </span><span>{</span><span>e</span><span>}</span><span>"</span><span>)</span> <span># Save any results in queue before exiting </span> <span>self</span><span>.</span><span>save_results_from_queue</span><span>()</span> <span>raise</span>try: # Scraping logic except Exception as e: logger.error(f"Error in scrape_all_products: {e}") # Save any results in queue before exiting self.save_results_from_queue() raise
Enter fullscreen mode Exit fullscreen mode
This ensures that even if the scraper crashes, partial results are saved.
Conclusion
The architecture outlined here demonstrates how to build a high-performance web scraper that balances speed, reliability, and target server courtesy. By leveraging asynchronous programming, connection pooling, and hybrid content extraction techniques, the scraper can efficiently process thousands of products while maintaining resilience against common scraping challenges.
Key takeaways:
- Asynchronous programming is essential for high-performance web scraping
- Hybrid static/dynamic extraction maximizes data completeness
- Proper error handling and resilience mechanisms are crucial for production use
- Configurable parameters allow for fine-tuning based on target site characteristics
暂无评论内容