<!–kg-card-end: html–><!–kg-card-begin: markdown–>

List crawling is a specialized form of web scraping that focuses on extracting collections of similar items from websites. Whether you’re gathering product catalogs, monitoring pricing across e-commerce platforms, or building a database of ranked content, list crawling provides the foundation for efficient and organized data collection.

In this article, we will explore practical techniques for crawling different types of web lists from product catalogs and infinite scrolling pages to articles, tables, and search results.

<!–kg-card-end: markdown–><!–kg-card-begin: markdown–>

What is List Crawling?

List crawling refers to the automated process of extracting collections of similar items from web pages.

Unlike general web scraping that might target diverse information from a page, list crawling specifically focuses on groups of structured data that follow consistent patterns such as product listings, search results, rankings, or tabular data.

Crawler Setup

Setting up a basic list crawler requires a few essential components. Python, with its rich ecosystem of libraries, offers an excellent foundation for building effective crawlers.

For our list crawling examples we’ll use Python with the following libraries:

requests – as our http client for retrieving pages.
BeautifulSoup – for parsing HTML data using CSS Selectors.
Playwright – for automating a real web browser for crawling tasks

All of these can be installed using this pip command:


$ pip install beautifulsoup4 requests playwright
$ pip install beautifulsoup4 requests playwright
$ pip install beautifulsoup4 requests playwright

Enter fullscreen mode Exit fullscreen mode

Once you have these libraries installed see this simple example item list crawler that scrapes 1 item page:


import requests
from bs4 import BeautifulSoup
def crawl_static_list(url):
    # Send HTTP request to the target URL
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    # Parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")
    # Find all product items
    items = soup.select("div.row.product")
    # Extract data from each item
    results = []
    for item in items:
        title = item.select_one("h3.mb-0 a").text.strip()
        price = item.select_one("div.price").text.strip()
        results.append({"title": title, "price": price})
    return results
url = "https://web-scraping.dev/products"
data = crawl_static_list(url)
print(f"Found {len(data)} items")
for item in data[:3]: # Print first 3 items as example
    print(f"Title: {item['title']}, Price: {item['price']}")
import requests
from bs4 import BeautifulSoup

def crawl_static_list(url):
    # Send HTTP request to the target URL
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

    # Parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all product items
    items = soup.select("div.row.product")

    # Extract data from each item
    results = []
    for item in items:
        title = item.select_one("h3.mb-0 a").text.strip()
        price = item.select_one("div.price").text.strip()
        results.append({"title": title, "price": price})

    return results

url = "https://web-scraping.dev/products"
data = crawl_static_list(url)
print(f"Found {len(data)} items")
for item in data[:3]: # Print first 3 items as example
    print(f"Title: {item['title']}, Price: {item['price']}")
import requests
from bs4 import BeautifulSoup

def crawl_static_list(url):
    # Send HTTP request to the target URL
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})

    # Parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all product items
    items = soup.select("div.row.product")

    # Extract data from each item
    results = []
    for item in items:
        title = item.select_one("h3.mb-0 a").text.strip()
        price = item.select_one("div.price").text.strip()
        results.append({"title": title, "price": price})

    return results

url = "https://web-scraping.dev/products"
data = crawl_static_list(url)
print(f"Found {len(data)} items")
for item in data[:3]: # Print first 3 items as example
    print(f"Title: {item['title']}, Price: {item['price']}")

Enter fullscreen mode Exit fullscreen mode

Example Output


Found 5 items
Title: Box of Chocolate Candy, Price: 24.99
Title: Dark Red Energy Potion, Price: 4.99
Title: Teal Energy Potion, Price: 4.99
Found 5 items
Title: Box of Chocolate Candy, Price: 24.99
Title: Dark Red Energy Potion, Price: 4.99
Title: Teal Energy Potion, Price: 4.99

Found 5 items
Title: Box of Chocolate Candy, Price: 24.99
Title: Dark Red Energy Potion, Price: 4.99
Title: Teal Energy Potion, Price: 4.99

Enter fullscreen mode Exit fullscreen mode

In the above code, we’re making an HTTP request to a target URL, parsing the HTML content using BeautifulSoup, and then extracting specific data points from each list item.

This approach works well for simple, static lists where all content is loaded immediately. For more complex scenarios like paginated or dynamically loaded lists, you’ll need to extend this foundation with additional techniques we’ll cover in subsequent sections.

Your crawler’s effectiveness largely depends on how well you understand the structure of the target website. Taking time to inspect the HTML using browser developer tools will help you craft precise selectors that accurately target the desired elements.

Let’s now see how we can enhance our basic crawler with more advanced capabilities and different list crawling scenarios

Power-Up with Scrapfly

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass – scrape web pages without blocking!
Rotating residential proxies – prevent IP address and geographic blocks.
JavaScript rendering – scrape dynamic web pages through cloud browsers.
Full browser automation – control browsers to scroll, input and click on objects.
Format conversion – scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Here’s an example of how to scrape a product with the Scrapfly web scraping API:


from scrapfly import ScrapflyClient, ScrapeConfig
# Create a ScrapflyClient instance
client = ScrapflyClient(key='YOUR-SCRAPFLY-KEY')
# Create scrape requests
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    # optional: set country to get localized results
    country="us",
    # optional: use cloud browsers
    render_js=True,
    # optional: scroll to the bottom of the page
    auto_scroll=True,
))
print(api_result.result["context"]) # metadata
print(api_result.result["config"]) # request data
print(api_result.scrape_result["content"]) # result html content
# parse data yourself
product = {
    "title": api_result.selector.css("h3.product-title::text").get(),
    "price": api_result.selector.css(".product-price::text").get(),
    "description": api_result.selector.css(".product-description::text").get(),
}
print(product)
# or let AI parser extract it for you!
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    # use AI models to find ALL product data available on the page
    extraction_model="product"
))
from scrapfly import ScrapflyClient, ScrapeConfig

# Create a ScrapflyClient instance
client = ScrapflyClient(key='YOUR-SCRAPFLY-KEY')
# Create scrape requests
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    # optional: set country to get localized results
    country="us",
    # optional: use cloud browsers
    render_js=True,
    # optional: scroll to the bottom of the page
    auto_scroll=True,
))

print(api_result.result["context"]) # metadata
print(api_result.result["config"]) # request data
print(api_result.scrape_result["content"]) # result html content

# parse data yourself
product = {
    "title": api_result.selector.css("h3.product-title::text").get(),
    "price": api_result.selector.css(".product-price::text").get(),
    "description": api_result.selector.css(".product-description::text").get(),
}
print(product)

# or let AI parser extract it for you!
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    # use AI models to find ALL product data available on the page
    extraction_model="product"
))
from scrapfly import ScrapflyClient, ScrapeConfig

# Create a ScrapflyClient instance
client = ScrapflyClient(key='YOUR-SCRAPFLY-KEY')
# Create scrape requests
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    # optional: set country to get localized results
    country="us",
    # optional: use cloud browsers
    render_js=True,
    # optional: scroll to the bottom of the page
    auto_scroll=True,
))

print(api_result.result["context"]) # metadata
print(api_result.result["config"]) # request data
print(api_result.scrape_result["content"]) # result html content

# parse data yourself
product = {
    "title": api_result.selector.css("h3.product-title::text").get(),
    "price": api_result.selector.css(".product-price::text").get(),
    "description": api_result.selector.css(".product-description::text").get(),
}
print(product)

# or let AI parser extract it for you!
api_result = client.scrape(ScrapeConfig(
    url="https://web-scraping.dev/product/1",
    # use AI models to find ALL product data available on the page
    extraction_model="product"
))

Enter fullscreen mode Exit fullscreen mode

Example Output


{
    "title": "Box of Chocolate Candy",
    "price": "$9.99 ",
    "description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",
}
{
    "title": "Box of Chocolate Candy",
    "price": "$9.99 ",
    "description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",
}

{
    "title": "Box of Chocolate Candy",
    "price": "$9.99 ",
    "description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",
}

Enter fullscreen mode Exit fullscreen mode

Try for FREE!

Paginated List Crawling

Paginated lists split the data across multiple pages with numbered navigation. This technique is common in e-commerce, search results, and data directories.

One example of paginated pages is web-scraping.dev/products which splits products through several pages.

paginated list on web-scraping.dev/products

Example Crawler

Here’s how to build a product list crawler that handles traditional pagination:


import requests
from bs4 import BeautifulSoup
# Get first page and extract pagination URLs
url = "https://web-scraping.dev/products"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
other_page_urls = set(a.attrs["href"] for a in soup.select(".paging>a") if a.attrs.get("href"))
# Extract product titles from first page
all_product_titles = [a.text.strip() for a in soup.select(".product h3 a")]
# Extract product titles from other pages
for url in other_page_urls:
    page_soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    all_product_titles.extend(a.text.strip() for a in page_soup.select(".product h3 a"))
# Print results
print(f"Total products found: {len(all_product_titles)}")
print("\nProduct Titles:")
for i, title in enumerate(all_product_titles, 1):
    print(f"{i}. {title}")
import requests
from bs4 import BeautifulSoup

# Get first page and extract pagination URLs
url = "https://web-scraping.dev/products"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
other_page_urls = set(a.attrs["href"] for a in soup.select(".paging>a") if a.attrs.get("href"))

# Extract product titles from first page
all_product_titles = [a.text.strip() for a in soup.select(".product h3 a")]

# Extract product titles from other pages
for url in other_page_urls:
    page_soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    all_product_titles.extend(a.text.strip() for a in page_soup.select(".product h3 a"))

# Print results
print(f"Total products found: {len(all_product_titles)}")
print("\nProduct Titles:")
for i, title in enumerate(all_product_titles, 1):
    print(f"{i}. {title}")
import requests
from bs4 import BeautifulSoup

# Get first page and extract pagination URLs
url = "https://web-scraping.dev/products"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
other_page_urls = set(a.attrs["href"] for a in soup.select(".paging>a") if a.attrs.get("href"))

# Extract product titles from first page
all_product_titles = [a.text.strip() for a in soup.select(".product h3 a")]

# Extract product titles from other pages
for url in other_page_urls:
    page_soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    all_product_titles.extend(a.text.strip() for a in page_soup.select(".product h3 a"))

# Print results
print(f"Total products found: {len(all_product_titles)}")
print("\nProduct Titles:")
for i, title in enumerate(all_product_titles, 1):
    print(f"{i}. {title}")

Enter fullscreen mode Exit fullscreen mode

Example Output


Total products found: 30
Product Titles:
Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Dragon Energy Potion
Hiking Boots for Outdoor Adventures
Women's High Heel Sandals
Running Shoes for Men
Kids' Light-Up Sneakers
Classic Leather Sneakers
Cat-Ear Beanie
Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Dragon Energy Potion
Hiking Boots for Outdoor Adventures
Women's High Heel Sandals
Running Shoes for Men
Kids' Light-Up Sneakers
Classic Leather Sneakers
Cat-Ear Beanie
Box of Chocolate Candy
Total products found: 30
Product Titles:

Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Dragon Energy Potion
Hiking Boots for Outdoor Adventures
Women's High Heel Sandals
Running Shoes for Men
Kids' Light-Up Sneakers
Classic Leather Sneakers
Cat-Ear Beanie
Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Dragon Energy Potion
Hiking Boots for Outdoor Adventures
Women's High Heel Sandals
Running Shoes for Men
Kids' Light-Up Sneakers
Classic Leather Sneakers
Cat-Ear Beanie
Box of Chocolate Candy

Total products found: 30
Product Titles:

Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Dragon Energy Potion
Hiking Boots for Outdoor Adventures
Women's High Heel Sandals
Running Shoes for Men
Kids' Light-Up Sneakers
Classic Leather Sneakers
Cat-Ear Beanie
Box of Chocolate Candy
Dark Red Energy Potion
Teal Energy Potion
Red Energy Potion
Blue Energy Potion
Dragon Energy Potion
Hiking Boots for Outdoor Adventures
Women's High Heel Sandals
Running Shoes for Men
Kids' Light-Up Sneakers
Classic Leather Sneakers
Cat-Ear Beanie
Box of Chocolate Candy

Enter fullscreen mode Exit fullscreen mode

In the above code, we first get the first page and extract pagination URLs. Then, we extract product titles from the first page and other pages. Finally, we print the total number of products found and the product titles.

Crawling Challenges

While crawling product lists, you’ll encounter several challenges:

Pagination Variations : Some sites use parameters like ?page=2 while others might use path segments like /page/2/ or even completely different URL structures.
Paging Limiting : Many sites restrict the maximum number of viewable pages (typically 20-50), even with thousands of products. Overcome this by using filters like price ranges to access the complete dataset as demonstrated in paging limit bypass tutorial.
Changing Layouts : Product list layouts may vary across different categories or during site updates.
Missing Data : Not all products will have complete information, requiring robust error handling.

Effective product list crawling requires adapting to these challenges with techniques like request throttling, robust selectors, and comprehensive error handling.

Let’s now explore how to handle more dynamic lists that load content as you scroll.

Endless List Crawling

Modern websites often implement infinite scrolling—a technique that continuously loads new content as the user scrolls down the page.

These “endless” lists present unique challenges for crawlers since the content isn’t divided into distinct pages but is loaded dynamically via JavaScript.

One example of infinite data lists is the web-scraping.dev/testimonials page:

endless list on web-scraping.dev/testimonials

Let’s see how we can crawl it next.

Example Crawler

To tackle endless lists, the easiet method is to use a headless browser that can execute JavaScript and simulate scrolling. Here’s an example using Playwright and Python:


# This example is using Playwright but it's also possible to use Selenium with similar approach
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://web-scraping.dev/testimonials/")
    # scroll to the bottom:
    _prev_height = -1
    _max_scrolls = 100
    _scroll_count = 0
    while _scroll_count < _max_scrolls:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        # Wait for new content to load (change this value as needed)
        page.wait_for_timeout(1000)
        # Check whether the scroll height changed - means more pages are there
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == _prev_height:
            break
        _prev_height = new_height
        _scroll_count += 1
    # now we can collect all loaded data:
    results = []
    for element in page.locator(".testimonial").element_handles():
        text = element.query_selector(".text").inner_html()
        results.append(text)
    print(f"scraped {len(results)} results")
# This example is using Playwright but it's also possible to use Selenium with similar approach
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://web-scraping.dev/testimonials/")

    # scroll to the bottom:
    _prev_height = -1
    _max_scrolls = 100
    _scroll_count = 0
    while _scroll_count < _max_scrolls:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        # Wait for new content to load (change this value as needed)
        page.wait_for_timeout(1000)
        # Check whether the scroll height changed - means more pages are there
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == _prev_height:
            break
        _prev_height = new_height
        _scroll_count += 1
    # now we can collect all loaded data:
    results = []
    for element in page.locator(".testimonial").element_handles():
        text = element.query_selector(".text").inner_html()
        results.append(text)
    print(f"scraped {len(results)} results")
# This example is using Playwright but it's also possible to use Selenium with similar approach
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://web-scraping.dev/testimonials/")

    # scroll to the bottom:
    _prev_height = -1
    _max_scrolls = 100
    _scroll_count = 0
    while _scroll_count < _max_scrolls:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        # Wait for new content to load (change this value as needed)
        page.wait_for_timeout(1000)
        # Check whether the scroll height changed - means more pages are there
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == _prev_height:
            break
        _prev_height = new_height
        _scroll_count += 1
    # now we can collect all loaded data:
    results = []
    for element in page.locator(".testimonial").element_handles():
        text = element.query_selector(".text").inner_html()
        results.append(text)
    print(f"scraped {len(results)} results")

Enter fullscreen mode Exit fullscreen mode

Example Output


scraped 60 results
scraped 60 results
scraped 60 results

Enter fullscreen mode Exit fullscreen mode

In the above code, we are using Playwright to control a browser and scroll to the bottom of the page to load all the testimonials. We are then collecting the text of each testimonial and printing the number of testimonials scraped. This approach effectively handles endless lists that load content dynamically.

Crawling Challenges

Endless list crawling comes with its own set of challenges:

Speed : Browser crawling is much slower than API-based approaches. When possible, reverse engineer the site’s API endpoints for direct data fetching often thousands of times faster, as shown in our reverse engineering of endless paging guide).
Resource Intensity : Running a headless browser consumes significantly more resources than simple HTTP requests.
Element Staleness : As the page updates, previously found elements may become “stale” and unusable, requiring refetching.
Scroll Triggers : Some sites use scroll-percentage triggers rather than scrolling to the bottom, requiring more nuanced scroll simulation.

Now that we’ve covered dynamic content loading, let’s explore how to extract structured data from article-based lists, which present their own unique challenges.

List Article Crawling

Articles featuring lists (like “Top 10 Programming Languages” or “5 Best Travel Destinations”) represent another valuable source of structured data. These lists are typically embedded within article content, organized under headings or with numbered sections.

Example Crawler

For this example, let’s scrape Scrapfly’s own top-10 listicle article using requests and beautifulsoup:


import requests
from bs4 import BeautifulSoup
response = requests.get("https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/")
# Check if the request was successful
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    libraries = []
else:
    # Parse the HTML content with BeautifulSoup
    # Using 'lxml' parser for better performance and more robust parsing
    soup = BeautifulSoup(response.text, 'lxml')
    # Find all h2 headings which represent the list items
    headings = soup.find_all('h2')
    libraries = []
    for heading in headings:
        # Get the heading text (library name)
        title = heading.text.strip()
        # Skip the "Summary" section
        if title.lower() == "summary":
            continue
        # Get the next paragraph for a brief description
        # In BeautifulSoup, we use .find_next() to get the next element
        next_paragraph = heading.find_next('p')
        description = next_paragraph.text.strip() if next_paragraph else ''
        libraries.append({
            "name": title,
            "description": description
        })
# Print the results
print("Top Web Scraping Libraries in Python:")
for i, lib in enumerate(libraries, 1):
    print(f"{i}. {lib['name']}")
    print(f" {lib['description'][:100]}...") # Print first 100 chars of description
import requests
from bs4 import BeautifulSoup

response = requests.get("https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/")

# Check if the request was successful
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    libraries = []
else:
    # Parse the HTML content with BeautifulSoup
    # Using 'lxml' parser for better performance and more robust parsing
    soup = BeautifulSoup(response.text, 'lxml')

    # Find all h2 headings which represent the list items
    headings = soup.find_all('h2')
    libraries = []

    for heading in headings:
        # Get the heading text (library name)
        title = heading.text.strip()

        # Skip the "Summary" section
        if title.lower() == "summary":
            continue

        # Get the next paragraph for a brief description
        # In BeautifulSoup, we use .find_next() to get the next element
        next_paragraph = heading.find_next('p')
        description = next_paragraph.text.strip() if next_paragraph else ''

        libraries.append({
            "name": title,
            "description": description
        })

# Print the results
print("Top Web Scraping Libraries in Python:")
for i, lib in enumerate(libraries, 1):
    print(f"{i}. {lib['name']}")
    print(f" {lib['description'][:100]}...") # Print first 100 chars of description
import requests
from bs4 import BeautifulSoup

response = requests.get("https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/")

# Check if the request was successful
if response.status_code != 200:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
    libraries = []
else:
    # Parse the HTML content with BeautifulSoup
    # Using 'lxml' parser for better performance and more robust parsing
    soup = BeautifulSoup(response.text, 'lxml')

    # Find all h2 headings which represent the list items
    headings = soup.find_all('h2')
    libraries = []

    for heading in headings:
        # Get the heading text (library name)
        title = heading.text.strip()

        # Skip the "Summary" section
        if title.lower() == "summary":
            continue

        # Get the next paragraph for a brief description
        # In BeautifulSoup, we use .find_next() to get the next element
        next_paragraph = heading.find_next('p')
        description = next_paragraph.text.strip() if next_paragraph else ''

        libraries.append({
            "name": title,
            "description": description
        })

# Print the results
print("Top Web Scraping Libraries in Python:")
for i, lib in enumerate(libraries, 1):
    print(f"{i}. {lib['name']}")
    print(f" {lib['description'][:100]}...") # Print first 100 chars of description

Enter fullscreen mode Exit fullscreen mode

Example Output


Top Web Scraping Libraries in Python:
1. HTTPX
   HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the p...
2. Parsel and LXML
   LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library lib...
3. BeautifulSoup
   Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that....
4. JMESPath and JSONPath
   JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language sim...
5. Playwright and Selenium
   Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript...
6. Cerberus and Pydantic
   An often overlooked process of web scraping is the data quality assurance step. Web scraping is a un...
7. Scrapfly Python SDK
   ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale....
8. Related Posts
   Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to cr...
Top Web Scraping Libraries in Python:
1. HTTPX
   HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the p...
2. Parsel and LXML
   LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library lib...
3. BeautifulSoup
   Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that....
4. JMESPath and JSONPath
   JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language sim...
5. Playwright and Selenium
   Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript...
6. Cerberus and Pydantic
   An often overlooked process of web scraping is the data quality assurance step. Web scraping is a un...
7. Scrapfly Python SDK
   ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale....
8. Related Posts
   Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to cr...

Top Web Scraping Libraries in Python:
1. HTTPX
   HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the p...
2. Parsel and LXML
   LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library lib...
3. BeautifulSoup
   Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that....
4. JMESPath and JSONPath
   JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language sim...
5. Playwright and Selenium
   Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript...
6. Cerberus and Pydantic
   An often overlooked process of web scraping is the data quality assurance step. Web scraping is a un...
7. Scrapfly Python SDK
   ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale....
8. Related Posts
   Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to cr...

Enter fullscreen mode Exit fullscreen mode

In this example, we used the requests library to make an HTTP GET request to a blog post about the top web scraping libraries in Python. We then used BeatifulSoup to parse the HTML content of the page and extract the list of libraries and their descriptions. Finally, we printed the results to the console.

Crawling Challenges

Extracting data from list articles requires understanding the content structure and accounting for variations in formatting. Some articles may use numbering in headings, while others rely solely on heading hierarchy. A robust crawler should handle these variations and clean the extracted text to remove extraneous content.

There are some tools that can assist you with listicle scraping:

newspaper4k (previously newspaper3k implements article parsing from HTML and various helper functions that can help to identify lists.
goose3 is another library that can extract structured data from articles, including lists.
trafilatura another powerful html parser with a lot of prebuilt functions to extract structured data from articles.
parsel extracts using powerful Xpath selectors allowing for very flexible and reliable extraction.
LLMs with RAG can be an easy way to extract data from list articles.

Let’s see tabular data next, which presents yet another structure for list information.

Table List Crawling

Tables represent another common format for presenting list data on the web. Whether implemented as HTML <table> elements or styled as tables using CSS grids or other layout techniques, they provide a structured way to display related data in rows and columns.

Example Crawler

For this example let’s see the table data section on web-scraping.dev/product/1 page:

table list on web-scraping.dev/product/1

Here’s how to extract data from HTML tables using BeautifulSoup html parsing library:


from bs4 import BeautifulSoup
import requests
response = requests.get("https://web-scraping.dev/product/1")
html = response.text
soup = BeautifulSoup(html, "lxml")
# First, select the desired table element (the 2nd one on the page)
table = soup.find_all('table', {'class': 'table-product'})[1]
headers = []
rows = []
for i, row in enumerate(table.find_all('tr')):
    if i == 0:
        headers = [el.text.strip() for el in row.find_all('th')]
    else:
        rows.append([el.text.strip() for el in row.find_all('td')])
print(headers)
print(rows)
from bs4 import BeautifulSoup
import requests

response = requests.get("https://web-scraping.dev/product/1")
html = response.text

soup = BeautifulSoup(html, "lxml")

# First, select the desired table element (the 2nd one on the page)
table = soup.find_all('table', {'class': 'table-product'})[1]

headers = []
rows = []

for i, row in enumerate(table.find_all('tr')):
    if i == 0:
        headers = [el.text.strip() for el in row.find_all('th')]
    else:
        rows.append([el.text.strip() for el in row.find_all('td')])

print(headers)
print(rows)
from bs4 import BeautifulSoup
import requests

response = requests.get("https://web-scraping.dev/product/1")
html = response.text

soup = BeautifulSoup(html, "lxml")

# First, select the desired table element (the 2nd one on the page)
table = soup.find_all('table', {'class': 'table-product'})[1]

headers = []
rows = []

for i, row in enumerate(table.find_all('tr')):
    if i == 0:
        headers = [el.text.strip() for el in row.find_all('th')]
    else:
        rows.append([el.text.strip() for el in row.find_all('td')])

print(headers)
print(rows)

Enter fullscreen mode Exit fullscreen mode

Example Output


['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type']
[['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping'], ['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping'], ['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping'], ['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping'], ['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']]
['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type']
[['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping'], ['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping'], ['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping'], ['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping'], ['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']]

['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type']
[['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping'], ['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping'], ['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping'], ['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping'], ['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']]

Enter fullscreen mode Exit fullscreen mode

In the above code, we’re identifying and parsing HTML tables, extracting both headers and data rows. The function handles various table structures, including those with and without explicit header elements. This approach gives you structured data that preserves the relationships between columns and rows.

Crawling Challenges

When crawling tables, it’s important to look beyond the obvious <table> elements. Many modern websites implement table-like layouts using CSS grid, flexbox, or other techniques. Identifying these structures requires careful inspection of the DOM and adapting your selectors accordingly.

All table structures are easy to handle using beautifulsoup, CSS Selectors or XPath powered algorithms though for more generic solutions can use LLMs and AI. One commonly used technique is to use LLMs to convert HTML to Markdown format which can often create accurate tables from flexible HTML table structures.

Now, let’s explore how to crawl search engine results pages for list-type content.

SERP List Crawling

Search Engine Results Pages (SERPs) offer a treasure trove of list-based content, presenting curated links to pages relevant to specific keywords. Crawling SERPs can help you discover list articles and other structured content across the web.

Example Crawler

Here’s a basic approach to crawling Google search results:

Python

ScrapFly AI


import requests
from bs4 import BeautifulSoup
import urllib.parse
def crawl_google_serp(query, num_results=10):
    # Format the query for URL
    encoded_query = urllib.parse.quote(query)
    # Create Google search URL
    url = f"https://www.google.com/search?q={encoded_query}&num={num_results}"
    # Add headers to mimic a browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    # Extract search results
    results = []
    # Target the organic search results
    for result in soup.select("div.g"):
        title_element = result.select_one("h3")
        if title_element:
            title = title_element.text
            # Extract URL
            link_element = result.select_one("a")
            link = link_element.get("href") if link_element else None
            # Extract snippet
            snippet_element = result.select_one("div.VwiC3b")
            snippet = snippet_element.text if snippet_element else None
            results.append({
                "title": title,
                "url": link,
                "snippet": snippet
            })
    return results
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(key="YOUR-SCRAPFLY-KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="https://www.google.com/search?q=python"
    # select country to get localized results
    country="us",
    # enable cloud browsers
    render_js=True,
    # scroll to the bottom of the page
    auto_scroll=True,
    # use AI to extract data 
    extraction_model="search_engine_results",
))
print(result.content)
import requests
from bs4 import BeautifulSoup
import urllib.parse

def crawl_google_serp(query, num_results=10):
    # Format the query for URL
    encoded_query = urllib.parse.quote(query)

    # Create Google search URL
    url = f"https://www.google.com/search?q={encoded_query}&num={num_results}"

    # Add headers to mimic a browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract search results
    results = []

    # Target the organic search results
    for result in soup.select("div.g"):
        title_element = result.select_one("h3")

        if title_element:
            title = title_element.text

            # Extract URL
            link_element = result.select_one("a")
            link = link_element.get("href") if link_element else None

            # Extract snippet
            snippet_element = result.select_one("div.VwiC3b")
            snippet = snippet_element.text if snippet_element else None

            results.append({
                "title": title,
                "url": link,
                "snippet": snippet
            })

    return results


from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR-SCRAPFLY-KEY")

result = scrapfly.scrape(ScrapeConfig(
    url="https://www.google.com/search?q=python"
    # select country to get localized results
    country="us",
    # enable cloud browsers
    render_js=True,
    # scroll to the bottom of the page
    auto_scroll=True,
    # use AI to extract data 
    extraction_model="search_engine_results",
))

print(result.content)
import requests
from bs4 import BeautifulSoup
import urllib.parse

def crawl_google_serp(query, num_results=10):
    # Format the query for URL
    encoded_query = urllib.parse.quote(query)

    # Create Google search URL
    url = f"https://www.google.com/search?q={encoded_query}&num={num_results}"

    # Add headers to mimic a browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9"
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract search results
    results = []

    # Target the organic search results
    for result in soup.select("div.g"):
        title_element = result.select_one("h3")

        if title_element:
            title = title_element.text

            # Extract URL
            link_element = result.select_one("a")
            link = link_element.get("href") if link_element else None

            # Extract snippet
            snippet_element = result.select_one("div.VwiC3b")
            snippet = snippet_element.text if snippet_element else None

            results.append({
                "title": title,
                "url": link,
                "snippet": snippet
            })

    return results


from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR-SCRAPFLY-KEY")

result = scrapfly.scrape(ScrapeConfig(
    url="https://www.google.com/search?q=python"
    # select country to get localized results
    country="us",
    # enable cloud browsers
    render_js=True,
    # scroll to the bottom of the page
    auto_scroll=True,
    # use AI to extract data 
    extraction_model="search_engine_results",
))

print(result.content)

Enter fullscreen mode Exit fullscreen mode

In the above code, we’re constructing a Google search query URL, sending an HTTP request with browser-like headers, and then parsing the HTML to extract organic search results. Each result includes the title, URL, and snippet text, which can help you identify list-type content for further crawling.

Crawling Challenges

It’s worth noting that directly crawling search engines can be challenging due to very strong anti-bot measures. For production applications, you may need to consider more sophisticated techniques to avoid blocks and for that see our blocking bypass introduction tutorial.

Scrapfly can easily bypass all SERP blocking measures and return AI extracted data for any SERP page using AI Web Scraping API.

To wrap up – let’s move on to some frequently asked questions about list crawling.

FAQ

Below are quick answers to common questions about list crawling techniques and best practices:

<!–kg-card-end: markdown–><!–kg-card-begin: html–>

What is the difference between list crawling and general web scraping?

List crawling focuses on extracting structured data from lists, such as paginated content, infinite scrolls, and tables. General web scraping targets various elements across different pages, while list crawling requires specific techniques for handling pagination, scroll events, and nested structures.

How do I handle rate limiting when crawling large lists?

Use adaptive delays (1-3 seconds) and increase them if you get 429 errors. Implement exponential backoff for failed requests and rotate proxies to distribute traffic. A request queuing system helps maintain a steady and sustainable request rate.

How can I extract structured data from deeply nested lists?

Identify nesting patterns using developer tools. Use a recursive function to process items and their children while preserving relationships. CSS selectors, XPath, and depth-first traversal help extract data while maintaining hierarchy.

<!–kg-card-end: html–><!–kg-card-begin: markdown–>

Summary

List crawling is essential for extracting structured data from the web’s many list formats. From product catalogs and social feeds to nested articles and data tables, each list type requires a tailored approach.

This guide has covered:

Setting up basic crawlers with Python libraries like BeautifulSoup and requests
Handling paginated lists that split content across multiple pages
Tackling endless scroll lists with headless browsers
Extracting structured data from article-based lists
Processing tabular data for row-column relationships
Crawling search engine results to discover more list content

The techniques demonstrated here from HTTP requests for static content to browser automation for dynamic pages provide powerful tools for transforming unstructured web data into valuable, actionable insights.

原文链接：Guide to List Crawling: Everything You Need to Know

展开阅读全文

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END