5 Advanced Python Web Crawling Techniques for Efficient Data Collection

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

Web crawling is a crucial technique for gathering data from the internet. As a developer, I’ve found that Python offers powerful tools for building efficient and scalable web crawlers. In this article, I’ll share five advanced techniques that have significantly improved my web crawling projects.

Asynchronous Crawling with asyncio and aiohttp

One of the most effective ways to boost a web crawler’s performance is by implementing asynchronous programming. Python’s asyncio library, combined with aiohttp, allows for concurrent HTTP requests, dramatically increasing the speed of data collection.

Here’s a basic example of how to implement asynchronous crawling:

<span>import</span> <span>asyncio</span>
<span>import</span> <span>aiohttp</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>
<span>async</span> <span>def</span> <span>fetch</span><span>(</span><span>session</span><span>,</span> <span>url</span><span>):</span>
<span>async</span> <span>with</span> <span>session</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span> <span>as</span> <span>response</span><span>:</span>
<span>return</span> <span>await</span> <span>response</span><span>.</span><span>text</span><span>()</span>
<span>async</span> <span>def</span> <span>parse</span><span>(</span><span>html</span><span>):</span>
<span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html</span><span>,</span> <span>'</span><span>lxml</span><span>'</span><span>)</span>
<span># Extract and process data here </span> <span>return</span> <span>data</span>
<span>async</span> <span>def</span> <span>crawl</span><span>(</span><span>urls</span><span>):</span>
<span>async</span> <span>with</span> <span>aiohttp</span><span>.</span><span>ClientSession</span><span>()</span> <span>as</span> <span>session</span><span>:</span>
<span>tasks</span> <span>=</span> <span>[</span><span>fetch</span><span>(</span><span>session</span><span>,</span> <span>url</span><span>)</span> <span>for</span> <span>url</span> <span>in</span> <span>urls</span><span>]</span>
<span>pages</span> <span>=</span> <span>await</span> <span>asyncio</span><span>.</span><span>gather</span><span>(</span><span>*</span><span>tasks</span><span>)</span>
<span>results</span> <span>=</span> <span>[</span><span>await</span> <span>parse</span><span>(</span><span>page</span><span>)</span> <span>for</span> <span>page</span> <span>in</span> <span>pages</span><span>]</span>
<span>return</span> <span>results</span>
<span>urls</span> <span>=</span> <span>[</span><span>'</span><span>http://example.com</span><span>'</span><span>,</span> <span>'</span><span>http://example.org</span><span>'</span><span>,</span> <span>'</span><span>http://example.net</span><span>'</span><span>]</span>
<span>results</span> <span>=</span> <span>asyncio</span><span>.</span><span>run</span><span>(</span><span>crawl</span><span>(</span><span>urls</span><span>))</span>
<span>import</span> <span>asyncio</span>
<span>import</span> <span>aiohttp</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>

<span>async</span> <span>def</span> <span>fetch</span><span>(</span><span>session</span><span>,</span> <span>url</span><span>):</span>
    <span>async</span> <span>with</span> <span>session</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span> <span>as</span> <span>response</span><span>:</span>
        <span>return</span> <span>await</span> <span>response</span><span>.</span><span>text</span><span>()</span>

<span>async</span> <span>def</span> <span>parse</span><span>(</span><span>html</span><span>):</span>
    <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html</span><span>,</span> <span>'</span><span>lxml</span><span>'</span><span>)</span>
    <span># Extract and process data here </span>    <span>return</span> <span>data</span>

<span>async</span> <span>def</span> <span>crawl</span><span>(</span><span>urls</span><span>):</span>
    <span>async</span> <span>with</span> <span>aiohttp</span><span>.</span><span>ClientSession</span><span>()</span> <span>as</span> <span>session</span><span>:</span>
        <span>tasks</span> <span>=</span> <span>[</span><span>fetch</span><span>(</span><span>session</span><span>,</span> <span>url</span><span>)</span> <span>for</span> <span>url</span> <span>in</span> <span>urls</span><span>]</span>
        <span>pages</span> <span>=</span> <span>await</span> <span>asyncio</span><span>.</span><span>gather</span><span>(</span><span>*</span><span>tasks</span><span>)</span>
        <span>results</span> <span>=</span> <span>[</span><span>await</span> <span>parse</span><span>(</span><span>page</span><span>)</span> <span>for</span> <span>page</span> <span>in</span> <span>pages</span><span>]</span>
    <span>return</span> <span>results</span>

<span>urls</span> <span>=</span> <span>[</span><span>'</span><span>http://example.com</span><span>'</span><span>,</span> <span>'</span><span>http://example.org</span><span>'</span><span>,</span> <span>'</span><span>http://example.net</span><span>'</span><span>]</span>
<span>results</span> <span>=</span> <span>asyncio</span><span>.</span><span>run</span><span>(</span><span>crawl</span><span>(</span><span>urls</span><span>))</span>
import asyncio import aiohttp from bs4 import BeautifulSoup async def fetch(session, url): async with session.get(url) as response: return await response.text() async def parse(html): soup = BeautifulSoup(html, 'lxml') # Extract and process data here return data async def crawl(urls): async with aiohttp.ClientSession() as session: tasks = [fetch(session, url) for url in urls] pages = await asyncio.gather(*tasks) results = [await parse(page) for page in pages] return results urls = ['http://example.com', 'http://example.org', 'http://example.net'] results = asyncio.run(crawl(urls))

Enter fullscreen mode Exit fullscreen mode

This code demonstrates how to fetch multiple URLs concurrently and parse the HTML content asynchronously. The asyncio.gather() function allows us to run multiple coroutines concurrently, significantly reducing the overall crawling time.

Distributed Crawling with Scrapy and ScrapyRT

For large-scale crawling projects, a distributed approach can be highly beneficial. Scrapy, a powerful web scraping framework, combined with ScrapyRT (Scrapy Real-Time), enables real-time, distributed web crawling.

Here’s a simple Scrapy spider example:

<span>import</span> <span>scrapy</span>
<span>class</span> <span>ExampleSpider</span><span>(</span><span>scrapy</span><span>.</span><span>Spider</span><span>):</span>
<span>name</span> <span>=</span> <span>'</span><span>example</span><span>'</span>
<span>start_urls</span> <span>=</span> <span>[</span><span>'</span><span>http://example.com</span><span>'</span><span>]</span>
<span>def</span> <span>parse</span><span>(</span><span>self</span><span>,</span> <span>response</span><span>):</span>
<span>for</span> <span>item</span> <span>in</span> <span>response</span><span>.</span><span>css</span><span>(</span><span>'</span><span>div.item</span><span>'</span><span>):</span>
<span>yield</span> <span>{</span>
<span>'</span><span>title</span><span>'</span><span>:</span> <span>item</span><span>.</span><span>css</span><span>(</span><span>'</span><span>h2::text</span><span>'</span><span>).</span><span>get</span><span>(),</span>
<span>'</span><span>link</span><span>'</span><span>:</span> <span>item</span><span>.</span><span>css</span><span>(</span><span>'</span><span>a::attr(href)</span><span>'</span><span>).</span><span>get</span><span>(),</span>
<span>'</span><span>description</span><span>'</span><span>:</span> <span>item</span><span>.</span><span>css</span><span>(</span><span>'</span><span>p::text</span><span>'</span><span>).</span><span>get</span><span>()</span>
<span>}</span>
<span>next_page</span> <span>=</span> <span>response</span><span>.</span><span>css</span><span>(</span><span>'</span><span>a.next-page::attr(href)</span><span>'</span><span>).</span><span>get</span><span>()</span>
<span>if</span> <span>next_page</span><span>:</span>
<span>yield</span> <span>response</span><span>.</span><span>follow</span><span>(</span><span>next_page</span><span>,</span> <span>self</span><span>.</span><span>parse</span><span>)</span>
<span>import</span> <span>scrapy</span>

<span>class</span> <span>ExampleSpider</span><span>(</span><span>scrapy</span><span>.</span><span>Spider</span><span>):</span>
    <span>name</span> <span>=</span> <span>'</span><span>example</span><span>'</span>
    <span>start_urls</span> <span>=</span> <span>[</span><span>'</span><span>http://example.com</span><span>'</span><span>]</span>

    <span>def</span> <span>parse</span><span>(</span><span>self</span><span>,</span> <span>response</span><span>):</span>
        <span>for</span> <span>item</span> <span>in</span> <span>response</span><span>.</span><span>css</span><span>(</span><span>'</span><span>div.item</span><span>'</span><span>):</span>
            <span>yield</span> <span>{</span>
                <span>'</span><span>title</span><span>'</span><span>:</span> <span>item</span><span>.</span><span>css</span><span>(</span><span>'</span><span>h2::text</span><span>'</span><span>).</span><span>get</span><span>(),</span>
                <span>'</span><span>link</span><span>'</span><span>:</span> <span>item</span><span>.</span><span>css</span><span>(</span><span>'</span><span>a::attr(href)</span><span>'</span><span>).</span><span>get</span><span>(),</span>
                <span>'</span><span>description</span><span>'</span><span>:</span> <span>item</span><span>.</span><span>css</span><span>(</span><span>'</span><span>p::text</span><span>'</span><span>).</span><span>get</span><span>()</span>
            <span>}</span>

        <span>next_page</span> <span>=</span> <span>response</span><span>.</span><span>css</span><span>(</span><span>'</span><span>a.next-page::attr(href)</span><span>'</span><span>).</span><span>get</span><span>()</span>
        <span>if</span> <span>next_page</span><span>:</span>
            <span>yield</span> <span>response</span><span>.</span><span>follow</span><span>(</span><span>next_page</span><span>,</span> <span>self</span><span>.</span><span>parse</span><span>)</span>
import scrapy class ExampleSpider(scrapy.Spider): name = 'example' start_urls = ['http://example.com'] def parse(self, response): for item in response.css('div.item'): yield { 'title': item.css('h2::text').get(), 'link': item.css('a::attr(href)').get(), 'description': item.css('p::text').get() } next_page = response.css('a.next-page::attr(href)').get() if next_page: yield response.follow(next_page, self.parse)

Enter fullscreen mode Exit fullscreen mode

To use ScrapyRT for real-time extraction, you can set up a ScrapyRT server and make HTTP requests to it:

<span>import</span> <span>requests</span>
<span>url</span> <span>=</span> <span>'</span><span>http://localhost:9080/crawl.json</span><span>'</span>
<span>params</span> <span>=</span> <span>{</span>
<span>'</span><span>spider_name</span><span>'</span><span>:</span> <span>'</span><span>example</span><span>'</span><span>,</span>
<span>'</span><span>url</span><span>'</span><span>:</span> <span>'</span><span>http://example.com</span><span>'</span>
<span>}</span>
<span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>params</span><span>=</span><span>params</span><span>)</span>
<span>data</span> <span>=</span> <span>response</span><span>.</span><span>json</span><span>()</span>
<span>import</span> <span>requests</span>

<span>url</span> <span>=</span> <span>'</span><span>http://localhost:9080/crawl.json</span><span>'</span>
<span>params</span> <span>=</span> <span>{</span>
    <span>'</span><span>spider_name</span><span>'</span><span>:</span> <span>'</span><span>example</span><span>'</span><span>,</span>
    <span>'</span><span>url</span><span>'</span><span>:</span> <span>'</span><span>http://example.com</span><span>'</span>
<span>}</span>
<span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>params</span><span>=</span><span>params</span><span>)</span>
<span>data</span> <span>=</span> <span>response</span><span>.</span><span>json</span><span>()</span>
import requests url = 'http://localhost:9080/crawl.json' params = { 'spider_name': 'example', 'url': 'http://example.com' } response = requests.get(url, params=params) data = response.json()

Enter fullscreen mode Exit fullscreen mode

This approach allows for on-demand crawling and easy integration with other systems.

Handling JavaScript-Rendered Content with Selenium

Many modern websites use JavaScript to render content dynamically. To handle such cases, Selenium WebDriver is an excellent tool. It allows us to automate web browsers and interact with JavaScript-rendered elements.

Here’s an example of using Selenium with Python:

<span>from</span> <span>selenium</span> <span>import</span> <span>webdriver</span>
<span>from</span> <span>selenium.webdriver.common.by</span> <span>import</span> <span>By</span>
<span>from</span> <span>selenium.webdriver.support.ui</span> <span>import</span> <span>WebDriverWait</span>
<span>from</span> <span>selenium.webdriver.support</span> <span>import</span> <span>expected_conditions</span> <span>as</span> <span>EC</span>
<span>driver</span> <span>=</span> <span>webdriver</span><span>.</span><span>Chrome</span><span>()</span>
<span>driver</span><span>.</span><span>get</span><span>(</span><span>"</span><span>http://example.com</span><span>"</span><span>)</span>
<span># Wait for a specific element to load </span><span>element</span> <span>=</span> <span>WebDriverWait</span><span>(</span><span>driver</span><span>,</span> <span>10</span><span>).</span><span>until</span><span>(</span>
<span>EC</span><span>.</span><span>presence_of_element_located</span><span>((</span><span>By</span><span>.</span><span>ID</span><span>,</span> <span>"</span><span>dynamic-content</span><span>"</span><span>))</span>
<span>)</span>
<span># Extract data </span><span>data</span> <span>=</span> <span>element</span><span>.</span><span>text</span>
<span>driver</span><span>.</span><span>quit</span><span>()</span>
<span>from</span> <span>selenium</span> <span>import</span> <span>webdriver</span>
<span>from</span> <span>selenium.webdriver.common.by</span> <span>import</span> <span>By</span>
<span>from</span> <span>selenium.webdriver.support.ui</span> <span>import</span> <span>WebDriverWait</span>
<span>from</span> <span>selenium.webdriver.support</span> <span>import</span> <span>expected_conditions</span> <span>as</span> <span>EC</span>

<span>driver</span> <span>=</span> <span>webdriver</span><span>.</span><span>Chrome</span><span>()</span>
<span>driver</span><span>.</span><span>get</span><span>(</span><span>"</span><span>http://example.com</span><span>"</span><span>)</span>

<span># Wait for a specific element to load </span><span>element</span> <span>=</span> <span>WebDriverWait</span><span>(</span><span>driver</span><span>,</span> <span>10</span><span>).</span><span>until</span><span>(</span>
    <span>EC</span><span>.</span><span>presence_of_element_located</span><span>((</span><span>By</span><span>.</span><span>ID</span><span>,</span> <span>"</span><span>dynamic-content</span><span>"</span><span>))</span>
<span>)</span>

<span># Extract data </span><span>data</span> <span>=</span> <span>element</span><span>.</span><span>text</span>

<span>driver</span><span>.</span><span>quit</span><span>()</span>
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() driver.get("http://example.com") # Wait for a specific element to load element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "dynamic-content")) ) # Extract data data = element.text driver.quit()

Enter fullscreen mode Exit fullscreen mode

This code demonstrates how to wait for dynamic content to load before extracting it. Selenium is particularly useful for crawling single-page applications or websites with complex user interactions.

Using Proxies and IP Rotation

To avoid rate limiting and IP bans, it’s crucial to implement proxy rotation in your web crawler. This technique involves cycling through different IP addresses for each request.

Here’s an example of how to use proxies with the requests library:

<span>import</span> <span>requests</span>
<span>from</span> <span>itertools</span> <span>import</span> <span>cycle</span>
<span>proxies</span> <span>=</span> <span>[</span>
<span>{</span><span>'</span><span>http</span><span>'</span><span>:</span> <span>'</span><span>http://proxy1.com:8080</span><span>'</span><span>},</span>
<span>{</span><span>'</span><span>http</span><span>'</span><span>:</span> <span>'</span><span>http://proxy2.com:8080</span><span>'</span><span>},</span>
<span>{</span><span>'</span><span>http</span><span>'</span><span>:</span> <span>'</span><span>http://proxy3.com:8080</span><span>'</span><span>}</span>
<span>]</span>
<span>proxy_pool</span> <span>=</span> <span>cycle</span><span>(</span><span>proxies</span><span>)</span>
<span>for</span> <span>url</span> <span>in</span> <span>urls</span><span>:</span>
<span>proxy</span> <span>=</span> <span>next</span><span>(</span><span>proxy_pool</span><span>)</span>
<span>try</span><span>:</span>
<span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>proxies</span><span>=</span><span>proxy</span><span>)</span>
<span># Process the response </span> <span>except</span><span>:</span>
<span># Handle the error and possibly remove the faulty proxy </span> <span>pass</span>
<span>import</span> <span>requests</span>
<span>from</span> <span>itertools</span> <span>import</span> <span>cycle</span>

<span>proxies</span> <span>=</span> <span>[</span>
    <span>{</span><span>'</span><span>http</span><span>'</span><span>:</span> <span>'</span><span>http://proxy1.com:8080</span><span>'</span><span>},</span>
    <span>{</span><span>'</span><span>http</span><span>'</span><span>:</span> <span>'</span><span>http://proxy2.com:8080</span><span>'</span><span>},</span>
    <span>{</span><span>'</span><span>http</span><span>'</span><span>:</span> <span>'</span><span>http://proxy3.com:8080</span><span>'</span><span>}</span>
<span>]</span>
<span>proxy_pool</span> <span>=</span> <span>cycle</span><span>(</span><span>proxies</span><span>)</span>

<span>for</span> <span>url</span> <span>in</span> <span>urls</span><span>:</span>
    <span>proxy</span> <span>=</span> <span>next</span><span>(</span><span>proxy_pool</span><span>)</span>
    <span>try</span><span>:</span>
        <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>proxies</span><span>=</span><span>proxy</span><span>)</span>
        <span># Process the response </span>    <span>except</span><span>:</span>
        <span># Handle the error and possibly remove the faulty proxy </span>        <span>pass</span>
import requests from itertools import cycle proxies = [ {'http': 'http://proxy1.com:8080'}, {'http': 'http://proxy2.com:8080'}, {'http': 'http://proxy3.com:8080'} ] proxy_pool = cycle(proxies) for url in urls: proxy = next(proxy_pool) try: response = requests.get(url, proxies=proxy) # Process the response except: # Handle the error and possibly remove the faulty proxy pass

Enter fullscreen mode Exit fullscreen mode

This code cycles through a list of proxies for each request, helping to distribute the load and reduce the risk of being blocked.

Efficient HTML Parsing with lxml and CSS Selectors

For parsing HTML content, the lxml library combined with CSS selectors offers excellent performance and ease of use. Here’s an example:

<span>from</span> <span>lxml</span> <span>import</span> <span>html</span>
<span>import</span> <span>requests</span>
<span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>'</span><span>http://example.com</span><span>'</span><span>)</span>
<span>tree</span> <span>=</span> <span>html</span><span>.</span><span>fromstring</span><span>(</span><span>response</span><span>.</span><span>content</span><span>)</span>
<span># Extract data using CSS selectors </span><span>titles</span> <span>=</span> <span>tree</span><span>.</span><span>cssselect</span><span>(</span><span>'</span><span>h2.title</span><span>'</span><span>)</span>
<span>links</span> <span>=</span> <span>tree</span><span>.</span><span>cssselect</span><span>(</span><span>'</span><span>a.link</span><span>'</span><span>)</span>
<span>for</span> <span>title</span><span>,</span> <span>link</span> <span>in</span> <span>zip</span><span>(</span><span>titles</span><span>,</span> <span>links</span><span>):</span>
<span>print</span><span>(</span><span>title</span><span>.</span><span>text_content</span><span>(),</span> <span>link</span><span>.</span><span>get</span><span>(</span><span>'</span><span>href</span><span>'</span><span>))</span>
<span>from</span> <span>lxml</span> <span>import</span> <span>html</span>
<span>import</span> <span>requests</span>

<span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>'</span><span>http://example.com</span><span>'</span><span>)</span>
<span>tree</span> <span>=</span> <span>html</span><span>.</span><span>fromstring</span><span>(</span><span>response</span><span>.</span><span>content</span><span>)</span>

<span># Extract data using CSS selectors </span><span>titles</span> <span>=</span> <span>tree</span><span>.</span><span>cssselect</span><span>(</span><span>'</span><span>h2.title</span><span>'</span><span>)</span>
<span>links</span> <span>=</span> <span>tree</span><span>.</span><span>cssselect</span><span>(</span><span>'</span><span>a.link</span><span>'</span><span>)</span>

<span>for</span> <span>title</span><span>,</span> <span>link</span> <span>in</span> <span>zip</span><span>(</span><span>titles</span><span>,</span> <span>links</span><span>):</span>
    <span>print</span><span>(</span><span>title</span><span>.</span><span>text_content</span><span>(),</span> <span>link</span><span>.</span><span>get</span><span>(</span><span>'</span><span>href</span><span>'</span><span>))</span>
from lxml import html import requests response = requests.get('http://example.com') tree = html.fromstring(response.content) # Extract data using CSS selectors titles = tree.cssselect('h2.title') links = tree.cssselect('a.link') for title, link in zip(titles, links): print(title.text_content(), link.get('href'))

Enter fullscreen mode Exit fullscreen mode

This approach is significantly faster than using BeautifulSoup, especially for large HTML documents.

Best Practices for Scalable Web Crawling

When building scalable web crawlers, it’s important to follow best practices:

  1. Respect robots.txt: Always check and adhere to the rules set in the website’s robots.txt file.

  2. Implement polite crawling: Add delays between requests to avoid overwhelming the target server.

  3. Use proper user agents: Identify your crawler with an appropriate user agent string.

  4. Handle errors gracefully: Implement robust error handling and retry mechanisms.

  5. Store data efficiently: Use appropriate databases or file formats for storing large amounts of crawled data.

Here’s an example incorporating some of these practices:

<span>import</span> <span>requests</span>
<span>import</span> <span>time</span>
<span>from</span> <span>urllib.robotparser</span> <span>import</span> <span>RobotFileParser</span>
<span>class</span> <span>PoliteCrawler</span><span>:</span>
<span>def</span> <span>__init__</span><span>(</span><span>self</span><span>,</span> <span>delay</span><span>=</span><span>1</span><span>):</span>
<span>self</span><span>.</span><span>delay</span> <span>=</span> <span>delay</span>
<span>self</span><span>.</span><span>user_agent</span> <span>=</span> <span>'</span><span>PoliteCrawler/1.0</span><span>'</span>
<span>self</span><span>.</span><span>headers</span> <span>=</span> <span>{</span><span>'</span><span>User-Agent</span><span>'</span><span>:</span> <span>self</span><span>.</span><span>user_agent</span><span>}</span>
<span>self</span><span>.</span><span>rp</span> <span>=</span> <span>RobotFileParser</span><span>()</span>
<span>def</span> <span>can_fetch</span><span>(</span><span>self</span><span>,</span> <span>url</span><span>):</span>
<span>parts</span> <span>=</span> <span>url</span><span>.</span><span>split</span><span>(</span><span>'</span><span>/</span><span>'</span><span>)</span>
<span>root</span> <span>=</span> <span>f</span><span>"</span><span>{</span><span>parts</span><span>[</span><span>0</span><span>]</span><span>}</span><span>//</span><span>{</span><span>parts</span><span>[</span><span>2</span><span>]</span><span>}</span><span>"</span>
<span>self</span><span>.</span><span>rp</span><span>.</span><span>set_url</span><span>(</span><span>f</span><span>"</span><span>{</span><span>root</span><span>}</span><span>/robots.txt</span><span>"</span><span>)</span>
<span>self</span><span>.</span><span>rp</span><span>.</span><span>read</span><span>()</span>
<span>return</span> <span>self</span><span>.</span><span>rp</span><span>.</span><span>can_fetch</span><span>(</span><span>self</span><span>.</span><span>user_agent</span><span>,</span> <span>url</span><span>)</span>
<span>def</span> <span>crawl</span><span>(</span><span>self</span><span>,</span> <span>url</span><span>):</span>
<span>if</span> <span>not</span> <span>self</span><span>.</span><span>can_fetch</span><span>(</span><span>url</span><span>):</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>Crawling disallowed for </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>
<span>return</span>
<span>time</span><span>.</span><span>sleep</span><span>(</span><span>self</span><span>.</span><span>delay</span><span>)</span>
<span>try</span><span>:</span>
<span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>self</span><span>.</span><span>headers</span><span>)</span>
<span># Process the response </span> <span>print</span><span>(</span><span>f</span><span>"</span><span>Successfully crawled </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>
<span>except</span> <span>requests</span><span>.</span><span>RequestException</span> <span>as</span> <span>e</span><span>:</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>Error crawling </span><span>{</span><span>url</span><span>}</span><span>: </span><span>{</span><span>e</span><span>}</span><span>"</span><span>)</span>
<span>crawler</span> <span>=</span> <span>PoliteCrawler</span><span>()</span>
<span>crawler</span><span>.</span><span>crawl</span><span>(</span><span>'</span><span>http://example.com</span><span>'</span><span>)</span>
<span>import</span> <span>requests</span>
<span>import</span> <span>time</span>
<span>from</span> <span>urllib.robotparser</span> <span>import</span> <span>RobotFileParser</span>

<span>class</span> <span>PoliteCrawler</span><span>:</span>
    <span>def</span> <span>__init__</span><span>(</span><span>self</span><span>,</span> <span>delay</span><span>=</span><span>1</span><span>):</span>
        <span>self</span><span>.</span><span>delay</span> <span>=</span> <span>delay</span>
        <span>self</span><span>.</span><span>user_agent</span> <span>=</span> <span>'</span><span>PoliteCrawler/1.0</span><span>'</span>
        <span>self</span><span>.</span><span>headers</span> <span>=</span> <span>{</span><span>'</span><span>User-Agent</span><span>'</span><span>:</span> <span>self</span><span>.</span><span>user_agent</span><span>}</span>
        <span>self</span><span>.</span><span>rp</span> <span>=</span> <span>RobotFileParser</span><span>()</span>

    <span>def</span> <span>can_fetch</span><span>(</span><span>self</span><span>,</span> <span>url</span><span>):</span>
        <span>parts</span> <span>=</span> <span>url</span><span>.</span><span>split</span><span>(</span><span>'</span><span>/</span><span>'</span><span>)</span>
        <span>root</span> <span>=</span> <span>f</span><span>"</span><span>{</span><span>parts</span><span>[</span><span>0</span><span>]</span><span>}</span><span>//</span><span>{</span><span>parts</span><span>[</span><span>2</span><span>]</span><span>}</span><span>"</span>
        <span>self</span><span>.</span><span>rp</span><span>.</span><span>set_url</span><span>(</span><span>f</span><span>"</span><span>{</span><span>root</span><span>}</span><span>/robots.txt</span><span>"</span><span>)</span>
        <span>self</span><span>.</span><span>rp</span><span>.</span><span>read</span><span>()</span>
        <span>return</span> <span>self</span><span>.</span><span>rp</span><span>.</span><span>can_fetch</span><span>(</span><span>self</span><span>.</span><span>user_agent</span><span>,</span> <span>url</span><span>)</span>

    <span>def</span> <span>crawl</span><span>(</span><span>self</span><span>,</span> <span>url</span><span>):</span>
        <span>if</span> <span>not</span> <span>self</span><span>.</span><span>can_fetch</span><span>(</span><span>url</span><span>):</span>
            <span>print</span><span>(</span><span>f</span><span>"</span><span>Crawling disallowed for </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>
            <span>return</span>

        <span>time</span><span>.</span><span>sleep</span><span>(</span><span>self</span><span>.</span><span>delay</span><span>)</span>
        <span>try</span><span>:</span>
            <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>self</span><span>.</span><span>headers</span><span>)</span>
            <span># Process the response </span>            <span>print</span><span>(</span><span>f</span><span>"</span><span>Successfully crawled </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>
        <span>except</span> <span>requests</span><span>.</span><span>RequestException</span> <span>as</span> <span>e</span><span>:</span>
            <span>print</span><span>(</span><span>f</span><span>"</span><span>Error crawling </span><span>{</span><span>url</span><span>}</span><span>: </span><span>{</span><span>e</span><span>}</span><span>"</span><span>)</span>

<span>crawler</span> <span>=</span> <span>PoliteCrawler</span><span>()</span>
<span>crawler</span><span>.</span><span>crawl</span><span>(</span><span>'</span><span>http://example.com</span><span>'</span><span>)</span>
import requests import time from urllib.robotparser import RobotFileParser class PoliteCrawler: def __init__(self, delay=1): self.delay = delay self.user_agent = 'PoliteCrawler/1.0' self.headers = {'User-Agent': self.user_agent} self.rp = RobotFileParser() def can_fetch(self, url): parts = url.split('/') root = f"{parts[0]}//{parts[2]}" self.rp.set_url(f"{root}/robots.txt") self.rp.read() return self.rp.can_fetch(self.user_agent, url) def crawl(self, url): if not self.can_fetch(url): print(f"Crawling disallowed for {url}") return time.sleep(self.delay) try: response = requests.get(url, headers=self.headers) # Process the response print(f"Successfully crawled {url}") except requests.RequestException as e: print(f"Error crawling {url}: {e}") crawler = PoliteCrawler() crawler.crawl('http://example.com')

Enter fullscreen mode Exit fullscreen mode

This crawler checks the robots.txt file, implements a delay between requests, and uses a custom user agent.

Managing Large-Scale Crawling Operations

For large-scale crawling operations, consider the following strategies:

  1. Use a message queue: Implement a distributed task queue like Celery to manage crawling jobs across multiple machines.

  2. Implement a crawl frontier: Use a dedicated crawl frontier to manage the list of URLs to be crawled, ensuring efficient URL prioritization and deduplication.

  3. Monitor performance: Set up monitoring and logging to track the performance of your crawlers and quickly identify issues.

  4. Scale horizontally: Design your system to easily add more crawling nodes as needed.

Here’s a basic example of using Celery for distributed crawling:

<span>from</span> <span>celery</span> <span>import</span> <span>Celery</span>
<span>import</span> <span>requests</span>
<span>app</span> <span>=</span> <span>Celery</span><span>(</span><span>'</span><span>crawler</span><span>'</span><span>,</span> <span>broker</span><span>=</span><span>'</span><span>redis://localhost:6379</span><span>'</span><span>)</span>
<span>@app.task</span>
<span>def</span> <span>crawl_url</span><span>(</span><span>url</span><span>):</span>
<span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>
<span># Process the response </span> <span>return</span> <span>f</span><span>"</span><span>Crawled </span><span>{</span><span>url</span><span>}</span><span>"</span>
<span># In your main application </span><span>urls</span> <span>=</span> <span>[</span><span>'</span><span>http://example.com</span><span>'</span><span>,</span> <span>'</span><span>http://example.org</span><span>'</span><span>,</span> <span>'</span><span>http://example.net</span><span>'</span><span>]</span>
<span>results</span> <span>=</span> <span>[</span><span>crawl_url</span><span>.</span><span>delay</span><span>(</span><span>url</span><span>)</span> <span>for</span> <span>url</span> <span>in</span> <span>urls</span><span>]</span>
<span>from</span> <span>celery</span> <span>import</span> <span>Celery</span>
<span>import</span> <span>requests</span>

<span>app</span> <span>=</span> <span>Celery</span><span>(</span><span>'</span><span>crawler</span><span>'</span><span>,</span> <span>broker</span><span>=</span><span>'</span><span>redis://localhost:6379</span><span>'</span><span>)</span>

<span>@app.task</span>
<span>def</span> <span>crawl_url</span><span>(</span><span>url</span><span>):</span>
    <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>
    <span># Process the response </span>    <span>return</span> <span>f</span><span>"</span><span>Crawled </span><span>{</span><span>url</span><span>}</span><span>"</span>

<span># In your main application </span><span>urls</span> <span>=</span> <span>[</span><span>'</span><span>http://example.com</span><span>'</span><span>,</span> <span>'</span><span>http://example.org</span><span>'</span><span>,</span> <span>'</span><span>http://example.net</span><span>'</span><span>]</span>
<span>results</span> <span>=</span> <span>[</span><span>crawl_url</span><span>.</span><span>delay</span><span>(</span><span>url</span><span>)</span> <span>for</span> <span>url</span> <span>in</span> <span>urls</span><span>]</span>
from celery import Celery import requests app = Celery('crawler', broker='redis://localhost:6379') @app.task def crawl_url(url): response = requests.get(url) # Process the response return f"Crawled {url}" # In your main application urls = ['http://example.com', 'http://example.org', 'http://example.net'] results = [crawl_url.delay(url) for url in urls]

Enter fullscreen mode Exit fullscreen mode

This setup allows you to distribute crawling tasks across multiple worker processes or machines.

Building scalable web crawlers in Python requires a combination of efficient coding practices, the right tools, and a good understanding of web technologies. By implementing these five techniques – asynchronous crawling, distributed crawling, handling JavaScript content, using proxies, and efficient HTML parsing – you can create powerful and efficient web crawlers capable of handling large-scale data collection tasks.

Remember to always respect website terms of service and legal requirements when crawling. Ethical web scraping practices are crucial for maintaining a healthy internet ecosystem.

As you develop your web crawling projects, you’ll likely encounter unique challenges specific to your use case. Don’t hesitate to adapt these techniques and explore additional libraries and tools to meet your specific needs. With Python’s rich ecosystem and versatile libraries, you’re well-equipped to tackle even the most complex web crawling tasks.


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

原文链接:5 Advanced Python Web Crawling Techniques for Efficient Data Collection

© 版权声明
THE END
喜欢就支持一下吧
点赞8 分享
Many people start a career with a dream, then get busy forgetting it.
很多人一开始为了梦想而忙,后来忙得忘了梦想
评论 抢沙发

请登录后发表评论

    暂无评论内容