<!–kg-card-end: html–><!–kg-card-begin: markdown–>
List crawling is a specialized form of web scraping that focuses on extracting collections of similar items from websites. Whether you’re gathering product catalogs, monitoring pricing across e-commerce platforms, or building a database of ranked content, list crawling provides the foundation for efficient and organized data collection.
In this article, we will explore practical techniques for crawling different types of web lists from product catalogs and infinite scrolling pages to articles, tables, and search results.
<!–kg-card-end: markdown–><!–kg-card-begin: markdown–>
What is List Crawling?
List crawling refers to the automated process of extracting collections of similar items from web pages.
Unlike general web scraping that might target diverse information from a page, list crawling specifically focuses on groups of structured data that follow consistent patterns such as product listings, search results, rankings, or tabular data.
Crawler Setup
Setting up a basic list crawler requires a few essential components. Python, with its rich ecosystem of libraries, offers an excellent foundation for building effective crawlers.
For our list crawling examples we’ll use Python with the following libraries:
- requests – as our http client for retrieving pages.
- BeautifulSoup – for parsing HTML data using CSS Selectors.
- Playwright – for automating a real web browser for crawling tasks
All of these can be installed using this pip
command:
$ pip install beautifulsoup4 requests playwright$ pip install beautifulsoup4 requests playwright$ pip install beautifulsoup4 requests playwright
Enter fullscreen mode Exit fullscreen mode
Once you have these libraries installed see this simple example item list crawler that scrapes 1 item page:
import requestsfrom bs4 import BeautifulSoupdef crawl_static_list(url):# Send HTTP request to the target URLresponse = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})# Parse the HTML contentsoup = BeautifulSoup(response.text, "html.parser")# Find all product itemsitems = soup.select("div.row.product")# Extract data from each itemresults = []for item in items:title = item.select_one("h3.mb-0 a").text.strip()price = item.select_one("div.price").text.strip()results.append({"title": title, "price": price})return resultsurl = "https://web-scraping.dev/products"data = crawl_static_list(url)print(f"Found {len(data)} items")for item in data[:3]: # Print first 3 items as exampleprint(f"Title: {item['title']}, Price: {item['price']}")import requests from bs4 import BeautifulSoup def crawl_static_list(url): # Send HTTP request to the target URL response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}) # Parse the HTML content soup = BeautifulSoup(response.text, "html.parser") # Find all product items items = soup.select("div.row.product") # Extract data from each item results = [] for item in items: title = item.select_one("h3.mb-0 a").text.strip() price = item.select_one("div.price").text.strip() results.append({"title": title, "price": price}) return results url = "https://web-scraping.dev/products" data = crawl_static_list(url) print(f"Found {len(data)} items") for item in data[:3]: # Print first 3 items as example print(f"Title: {item['title']}, Price: {item['price']}")import requests from bs4 import BeautifulSoup def crawl_static_list(url): # Send HTTP request to the target URL response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}) # Parse the HTML content soup = BeautifulSoup(response.text, "html.parser") # Find all product items items = soup.select("div.row.product") # Extract data from each item results = [] for item in items: title = item.select_one("h3.mb-0 a").text.strip() price = item.select_one("div.price").text.strip() results.append({"title": title, "price": price}) return results url = "https://web-scraping.dev/products" data = crawl_static_list(url) print(f"Found {len(data)} items") for item in data[:3]: # Print first 3 items as example print(f"Title: {item['title']}, Price: {item['price']}")
Enter fullscreen mode Exit fullscreen mode
Example Output
Found 5 itemsTitle: Box of Chocolate Candy, Price: 24.99Title: Dark Red Energy Potion, Price: 4.99Title: Teal Energy Potion, Price: 4.99Found 5 items Title: Box of Chocolate Candy, Price: 24.99 Title: Dark Red Energy Potion, Price: 4.99 Title: Teal Energy Potion, Price: 4.99Found 5 items Title: Box of Chocolate Candy, Price: 24.99 Title: Dark Red Energy Potion, Price: 4.99 Title: Teal Energy Potion, Price: 4.99
Enter fullscreen mode Exit fullscreen mode
In the above code, we’re making an HTTP request to a target URL, parsing the HTML content using BeautifulSoup, and then extracting specific data points from each list item.
This approach works well for simple, static lists where all content is loaded immediately. For more complex scenarios like paginated or dynamically loaded lists, you’ll need to extend this foundation with additional techniques we’ll cover in subsequent sections.
Your crawler’s effectiveness largely depends on how well you understand the structure of the target website. Taking time to inspect the HTML using browser developer tools will help you craft precise selectors that accurately target the desired elements.
Let’s now see how we can enhance our basic crawler with more advanced capabilities and different list crawling scenarios
Power-Up with Scrapfly
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
- Anti-bot protection bypass – scrape web pages without blocking!
- Rotating residential proxies – prevent IP address and geographic blocks.
- JavaScript rendering – scrape dynamic web pages through cloud browsers.
- Full browser automation – control browsers to scroll, input and click on objects.
- Format conversion – scrape as HTML, JSON, Text, or Markdown.
- Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.
Here’s an example of how to scrape a product with the Scrapfly web scraping API:
from scrapfly import ScrapflyClient, ScrapeConfig# Create a ScrapflyClient instanceclient = ScrapflyClient(key='YOUR-SCRAPFLY-KEY')# Create scrape requestsapi_result = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1",# optional: set country to get localized resultscountry="us",# optional: use cloud browsersrender_js=True,# optional: scroll to the bottom of the pageauto_scroll=True,))print(api_result.result["context"]) # metadataprint(api_result.result["config"]) # request dataprint(api_result.scrape_result["content"]) # result html content# parse data yourselfproduct = {"title": api_result.selector.css("h3.product-title::text").get(),"price": api_result.selector.css(".product-price::text").get(),"description": api_result.selector.css(".product-description::text").get(),}print(product)# or let AI parser extract it for you!api_result = client.scrape(ScrapeConfig(url="https://web-scraping.dev/product/1",# use AI models to find ALL product data available on the pageextraction_model="product"))from scrapfly import ScrapflyClient, ScrapeConfig # Create a ScrapflyClient instance client = ScrapflyClient(key='YOUR-SCRAPFLY-KEY') # Create scrape requests api_result = client.scrape(ScrapeConfig( url="https://web-scraping.dev/product/1", # optional: set country to get localized results country="us", # optional: use cloud browsers render_js=True, # optional: scroll to the bottom of the page auto_scroll=True, )) print(api_result.result["context"]) # metadata print(api_result.result["config"]) # request data print(api_result.scrape_result["content"]) # result html content # parse data yourself product = { "title": api_result.selector.css("h3.product-title::text").get(), "price": api_result.selector.css(".product-price::text").get(), "description": api_result.selector.css(".product-description::text").get(), } print(product) # or let AI parser extract it for you! api_result = client.scrape(ScrapeConfig( url="https://web-scraping.dev/product/1", # use AI models to find ALL product data available on the page extraction_model="product" ))from scrapfly import ScrapflyClient, ScrapeConfig # Create a ScrapflyClient instance client = ScrapflyClient(key='YOUR-SCRAPFLY-KEY') # Create scrape requests api_result = client.scrape(ScrapeConfig( url="https://web-scraping.dev/product/1", # optional: set country to get localized results country="us", # optional: use cloud browsers render_js=True, # optional: scroll to the bottom of the page auto_scroll=True, )) print(api_result.result["context"]) # metadata print(api_result.result["config"]) # request data print(api_result.scrape_result["content"]) # result html content # parse data yourself product = { "title": api_result.selector.css("h3.product-title::text").get(), "price": api_result.selector.css(".product-price::text").get(), "description": api_result.selector.css(".product-description::text").get(), } print(product) # or let AI parser extract it for you! api_result = client.scrape(ScrapeConfig( url="https://web-scraping.dev/product/1", # use AI models to find ALL product data available on the page extraction_model="product" ))
Enter fullscreen mode Exit fullscreen mode
Example Output
{"title": "Box of Chocolate Candy","price": "$9.99 ","description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",}{ "title": "Box of Chocolate Candy", "price": "$9.99 ", "description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.", }{ "title": "Box of Chocolate Candy", "price": "$9.99 ", "description": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.", }
Enter fullscreen mode Exit fullscreen mode
Paginated List Crawling
Paginated lists split the data across multiple pages with numbered navigation. This technique is common in e-commerce, search results, and data directories.
One example of paginated pages is web-scraping.dev/products which splits products through several pages.
paginated list on web-scraping.dev/products
Example Crawler
Here’s how to build a product list crawler that handles traditional pagination:
import requestsfrom bs4 import BeautifulSoup# Get first page and extract pagination URLsurl = "https://web-scraping.dev/products"soup = BeautifulSoup(requests.get(url).text, 'html.parser')other_page_urls = set(a.attrs["href"] for a in soup.select(".paging>a") if a.attrs.get("href"))# Extract product titles from first pageall_product_titles = [a.text.strip() for a in soup.select(".product h3 a")]# Extract product titles from other pagesfor url in other_page_urls:page_soup = BeautifulSoup(requests.get(url).text, 'html.parser')all_product_titles.extend(a.text.strip() for a in page_soup.select(".product h3 a"))# Print resultsprint(f"Total products found: {len(all_product_titles)}")print("\nProduct Titles:")for i, title in enumerate(all_product_titles, 1):print(f"{i}. {title}")import requests from bs4 import BeautifulSoup # Get first page and extract pagination URLs url = "https://web-scraping.dev/products" soup = BeautifulSoup(requests.get(url).text, 'html.parser') other_page_urls = set(a.attrs["href"] for a in soup.select(".paging>a") if a.attrs.get("href")) # Extract product titles from first page all_product_titles = [a.text.strip() for a in soup.select(".product h3 a")] # Extract product titles from other pages for url in other_page_urls: page_soup = BeautifulSoup(requests.get(url).text, 'html.parser') all_product_titles.extend(a.text.strip() for a in page_soup.select(".product h3 a")) # Print results print(f"Total products found: {len(all_product_titles)}") print("\nProduct Titles:") for i, title in enumerate(all_product_titles, 1): print(f"{i}. {title}")import requests from bs4 import BeautifulSoup # Get first page and extract pagination URLs url = "https://web-scraping.dev/products" soup = BeautifulSoup(requests.get(url).text, 'html.parser') other_page_urls = set(a.attrs["href"] for a in soup.select(".paging>a") if a.attrs.get("href")) # Extract product titles from first page all_product_titles = [a.text.strip() for a in soup.select(".product h3 a")] # Extract product titles from other pages for url in other_page_urls: page_soup = BeautifulSoup(requests.get(url).text, 'html.parser') all_product_titles.extend(a.text.strip() for a in page_soup.select(".product h3 a")) # Print results print(f"Total products found: {len(all_product_titles)}") print("\nProduct Titles:") for i, title in enumerate(all_product_titles, 1): print(f"{i}. {title}")
Enter fullscreen mode Exit fullscreen mode
Example Output
Total products found: 30Product Titles:Box of Chocolate CandyDark Red Energy PotionTeal Energy PotionRed Energy PotionBlue Energy PotionBox of Chocolate CandyDark Red Energy PotionTeal Energy PotionRed Energy PotionBlue Energy PotionDragon Energy PotionHiking Boots for Outdoor AdventuresWomen's High Heel SandalsRunning Shoes for MenKids' Light-Up SneakersClassic Leather SneakersCat-Ear BeanieBox of Chocolate CandyDark Red Energy PotionTeal Energy PotionRed Energy PotionBlue Energy PotionDragon Energy PotionHiking Boots for Outdoor AdventuresWomen's High Heel SandalsRunning Shoes for MenKids' Light-Up SneakersClassic Leather SneakersCat-Ear BeanieBox of Chocolate CandyTotal products found: 30 Product Titles: Box of Chocolate Candy Dark Red Energy Potion Teal Energy Potion Red Energy Potion Blue Energy Potion Box of Chocolate Candy Dark Red Energy Potion Teal Energy Potion Red Energy Potion Blue Energy Potion Dragon Energy Potion Hiking Boots for Outdoor Adventures Women's High Heel Sandals Running Shoes for Men Kids' Light-Up Sneakers Classic Leather Sneakers Cat-Ear Beanie Box of Chocolate Candy Dark Red Energy Potion Teal Energy Potion Red Energy Potion Blue Energy Potion Dragon Energy Potion Hiking Boots for Outdoor Adventures Women's High Heel Sandals Running Shoes for Men Kids' Light-Up Sneakers Classic Leather Sneakers Cat-Ear Beanie Box of Chocolate CandyTotal products found: 30 Product Titles: Box of Chocolate Candy Dark Red Energy Potion Teal Energy Potion Red Energy Potion Blue Energy Potion Box of Chocolate Candy Dark Red Energy Potion Teal Energy Potion Red Energy Potion Blue Energy Potion Dragon Energy Potion Hiking Boots for Outdoor Adventures Women's High Heel Sandals Running Shoes for Men Kids' Light-Up Sneakers Classic Leather Sneakers Cat-Ear Beanie Box of Chocolate Candy Dark Red Energy Potion Teal Energy Potion Red Energy Potion Blue Energy Potion Dragon Energy Potion Hiking Boots for Outdoor Adventures Women's High Heel Sandals Running Shoes for Men Kids' Light-Up Sneakers Classic Leather Sneakers Cat-Ear Beanie Box of Chocolate Candy
Enter fullscreen mode Exit fullscreen mode
In the above code, we first get the first page and extract pagination URLs. Then, we extract product titles from the first page and other pages. Finally, we print the total number of products found and the product titles.
Crawling Challenges
While crawling product lists, you’ll encounter several challenges:
-
Pagination Variations : Some sites use parameters like
?page=2
while others might use path segments like/page/2/
or even completely different URL structures. -
Paging Limiting : Many sites restrict the maximum number of viewable pages (typically 20-50), even with thousands of products. Overcome this by using filters like price ranges to access the complete dataset as demonstrated in paging limit bypass tutorial.
-
Changing Layouts : Product list layouts may vary across different categories or during site updates.
-
Missing Data : Not all products will have complete information, requiring robust error handling.
Effective product list crawling requires adapting to these challenges with techniques like request throttling, robust selectors, and comprehensive error handling.
Let’s now explore how to handle more dynamic lists that load content as you scroll.
Endless List Crawling
Modern websites often implement infinite scrolling—a technique that continuously loads new content as the user scrolls down the page.
These “endless” lists present unique challenges for crawlers since the content isn’t divided into distinct pages but is loaded dynamically via JavaScript.
One example of infinite data lists is the web-scraping.dev/testimonials page:
endless list on web-scraping.dev/testimonials
Let’s see how we can crawl it next.
Example Crawler
To tackle endless lists, the easiet method is to use a headless browser that can execute JavaScript and simulate scrolling. Here’s an example using Playwright and Python:
# This example is using Playwright but it's also possible to use Selenium with similar approachfrom playwright.sync_api import sync_playwrightwith sync_playwright() as p:browser = p.chromium.launch(headless=False)context = browser.new_context()page = context.new_page()page.goto("https://web-scraping.dev/testimonials/")# scroll to the bottom:_prev_height = -1_max_scrolls = 100_scroll_count = 0while _scroll_count < _max_scrolls:# Execute JavaScript to scroll to the bottom of the pagepage.evaluate("window.scrollTo(0, document.body.scrollHeight)")# Wait for new content to load (change this value as needed)page.wait_for_timeout(1000)# Check whether the scroll height changed - means more pages are therenew_height = page.evaluate("document.body.scrollHeight")if new_height == _prev_height:break_prev_height = new_height_scroll_count += 1# now we can collect all loaded data:results = []for element in page.locator(".testimonial").element_handles():text = element.query_selector(".text").inner_html()results.append(text)print(f"scraped {len(results)} results")# This example is using Playwright but it's also possible to use Selenium with similar approach from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch(headless=False) context = browser.new_context() page = context.new_page() page.goto("https://web-scraping.dev/testimonials/") # scroll to the bottom: _prev_height = -1 _max_scrolls = 100 _scroll_count = 0 while _scroll_count < _max_scrolls: # Execute JavaScript to scroll to the bottom of the page page.evaluate("window.scrollTo(0, document.body.scrollHeight)") # Wait for new content to load (change this value as needed) page.wait_for_timeout(1000) # Check whether the scroll height changed - means more pages are there new_height = page.evaluate("document.body.scrollHeight") if new_height == _prev_height: break _prev_height = new_height _scroll_count += 1 # now we can collect all loaded data: results = [] for element in page.locator(".testimonial").element_handles(): text = element.query_selector(".text").inner_html() results.append(text) print(f"scraped {len(results)} results")# This example is using Playwright but it's also possible to use Selenium with similar approach from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch(headless=False) context = browser.new_context() page = context.new_page() page.goto("https://web-scraping.dev/testimonials/") # scroll to the bottom: _prev_height = -1 _max_scrolls = 100 _scroll_count = 0 while _scroll_count < _max_scrolls: # Execute JavaScript to scroll to the bottom of the page page.evaluate("window.scrollTo(0, document.body.scrollHeight)") # Wait for new content to load (change this value as needed) page.wait_for_timeout(1000) # Check whether the scroll height changed - means more pages are there new_height = page.evaluate("document.body.scrollHeight") if new_height == _prev_height: break _prev_height = new_height _scroll_count += 1 # now we can collect all loaded data: results = [] for element in page.locator(".testimonial").element_handles(): text = element.query_selector(".text").inner_html() results.append(text) print(f"scraped {len(results)} results")
Enter fullscreen mode Exit fullscreen mode
Example Output
scraped 60 resultsscraped 60 resultsscraped 60 results
Enter fullscreen mode Exit fullscreen mode
In the above code, we are using Playwright to control a browser and scroll to the bottom of the page to load all the testimonials. We are then collecting the text of each testimonial and printing the number of testimonials scraped. This approach effectively handles endless lists that load content dynamically.
Crawling Challenges
Endless list crawling comes with its own set of challenges:
-
Speed : Browser crawling is much slower than API-based approaches. When possible, reverse engineer the site’s API endpoints for direct data fetching often thousands of times faster, as shown in our reverse engineering of endless paging guide).
-
Resource Intensity : Running a headless browser consumes significantly more resources than simple HTTP requests.
-
Element Staleness : As the page updates, previously found elements may become “stale” and unusable, requiring refetching.
-
Scroll Triggers : Some sites use scroll-percentage triggers rather than scrolling to the bottom, requiring more nuanced scroll simulation.
Now that we’ve covered dynamic content loading, let’s explore how to extract structured data from article-based lists, which present their own unique challenges.
List Article Crawling
Articles featuring lists (like “Top 10 Programming Languages” or “5 Best Travel Destinations”) represent another valuable source of structured data. These lists are typically embedded within article content, organized under headings or with numbered sections.
Example Crawler
For this example, let’s scrape Scrapfly’s own top-10 listicle article using requests and beautifulsoup:
import requestsfrom bs4 import BeautifulSoupresponse = requests.get("https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/")# Check if the request was successfulif response.status_code != 200:print(f"Failed to retrieve the page. Status code: {response.status_code}")libraries = []else:# Parse the HTML content with BeautifulSoup# Using 'lxml' parser for better performance and more robust parsingsoup = BeautifulSoup(response.text, 'lxml')# Find all h2 headings which represent the list itemsheadings = soup.find_all('h2')libraries = []for heading in headings:# Get the heading text (library name)title = heading.text.strip()# Skip the "Summary" sectionif title.lower() == "summary":continue# Get the next paragraph for a brief description# In BeautifulSoup, we use .find_next() to get the next elementnext_paragraph = heading.find_next('p')description = next_paragraph.text.strip() if next_paragraph else ''libraries.append({"name": title,"description": description})# Print the resultsprint("Top Web Scraping Libraries in Python:")for i, lib in enumerate(libraries, 1):print(f"{i}. {lib['name']}")print(f" {lib['description'][:100]}...") # Print first 100 chars of descriptionimport requests from bs4 import BeautifulSoup response = requests.get("https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/") # Check if the request was successful if response.status_code != 200: print(f"Failed to retrieve the page. Status code: {response.status_code}") libraries = [] else: # Parse the HTML content with BeautifulSoup # Using 'lxml' parser for better performance and more robust parsing soup = BeautifulSoup(response.text, 'lxml') # Find all h2 headings which represent the list items headings = soup.find_all('h2') libraries = [] for heading in headings: # Get the heading text (library name) title = heading.text.strip() # Skip the "Summary" section if title.lower() == "summary": continue # Get the next paragraph for a brief description # In BeautifulSoup, we use .find_next() to get the next element next_paragraph = heading.find_next('p') description = next_paragraph.text.strip() if next_paragraph else '' libraries.append({ "name": title, "description": description }) # Print the results print("Top Web Scraping Libraries in Python:") for i, lib in enumerate(libraries, 1): print(f"{i}. {lib['name']}") print(f" {lib['description'][:100]}...") # Print first 100 chars of descriptionimport requests from bs4 import BeautifulSoup response = requests.get("https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/") # Check if the request was successful if response.status_code != 200: print(f"Failed to retrieve the page. Status code: {response.status_code}") libraries = [] else: # Parse the HTML content with BeautifulSoup # Using 'lxml' parser for better performance and more robust parsing soup = BeautifulSoup(response.text, 'lxml') # Find all h2 headings which represent the list items headings = soup.find_all('h2') libraries = [] for heading in headings: # Get the heading text (library name) title = heading.text.strip() # Skip the "Summary" section if title.lower() == "summary": continue # Get the next paragraph for a brief description # In BeautifulSoup, we use .find_next() to get the next element next_paragraph = heading.find_next('p') description = next_paragraph.text.strip() if next_paragraph else '' libraries.append({ "name": title, "description": description }) # Print the results print("Top Web Scraping Libraries in Python:") for i, lib in enumerate(libraries, 1): print(f"{i}. {lib['name']}") print(f" {lib['description'][:100]}...") # Print first 100 chars of description
Enter fullscreen mode Exit fullscreen mode
Example Output
Top Web Scraping Libraries in Python:1. HTTPXHTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the p...2. Parsel and LXMLLXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library lib...3. BeautifulSoupBeautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that....4. JMESPath and JSONPathJMESPath and JSONPath are two libraries that allow you to query JSON data using a query language sim...5. Playwright and SeleniumHeadless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript...6. Cerberus and PydanticAn often overlooked process of web scraping is the data quality assurance step. Web scraping is a un...7. Scrapfly Python SDKScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale....8. Related PostsLearn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to cr...Top Web Scraping Libraries in Python: 1. HTTPX HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the p... 2. Parsel and LXML LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library lib... 3. BeautifulSoup Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that.... 4. JMESPath and JSONPath JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language sim... 5. Playwright and Selenium Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript... 6. Cerberus and Pydantic An often overlooked process of web scraping is the data quality assurance step. Web scraping is a un... 7. Scrapfly Python SDK ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.... 8. Related Posts Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to cr...Top Web Scraping Libraries in Python: 1. HTTPX HTTPX is by far the most complete and modern HTTP client package for Python. It is inspired by the p... 2. Parsel and LXML LXML is a fast and feature-rich HTML/XML parser for Python. It is a wrapper around the C library lib... 3. BeautifulSoup Beautifulsoup (aka bs4) is another HTML parser library in Python. Though it's much more than that.... 4. JMESPath and JSONPath JMESPath and JSONPath are two libraries that allow you to query JSON data using a query language sim... 5. Playwright and Selenium Headless browsers are becoming very popular in web scraping as a way to deal with dynamic javascript... 6. Cerberus and Pydantic An often overlooked process of web scraping is the data quality assurance step. Web scraping is a un... 7. Scrapfly Python SDK ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.... 8. Related Posts Learn how to efficiently find all URLs on a domain using Python and web crawling. Guide on how to cr...
Enter fullscreen mode Exit fullscreen mode
In this example, we used the requests library to make an HTTP GET request to a blog post about the top web scraping libraries in Python. We then used BeatifulSoup to parse the HTML content of the page and extract the list of libraries and their descriptions. Finally, we printed the results to the console.
Crawling Challenges
Extracting data from list articles requires understanding the content structure and accounting for variations in formatting. Some articles may use numbering in headings, while others rely solely on heading hierarchy. A robust crawler should handle these variations and clean the extracted text to remove extraneous content.
There are some tools that can assist you with listicle scraping:
- newspaper4k (previously newspaper3k implements article parsing from HTML and various helper functions that can help to identify lists.
- goose3 is another library that can extract structured data from articles, including lists.
- trafilatura another powerful html parser with a lot of prebuilt functions to extract structured data from articles.
- parsel extracts using powerful Xpath selectors allowing for very flexible and reliable extraction.
- LLMs with RAG can be an easy way to extract data from list articles.
Let’s see tabular data next, which presents yet another structure for list information.
Table List Crawling
Tables represent another common format for presenting list data on the web. Whether implemented as HTML <table>
elements or styled as tables using CSS grids or other layout techniques, they provide a structured way to display related data in rows and columns.
Example Crawler
For this example let’s see the table data section on web-scraping.dev/product/1 page:
table list on web-scraping.dev/product/1
Here’s how to extract data from HTML tables using BeautifulSoup html parsing library:
from bs4 import BeautifulSoupimport requestsresponse = requests.get("https://web-scraping.dev/product/1")html = response.textsoup = BeautifulSoup(html, "lxml")# First, select the desired table element (the 2nd one on the page)table = soup.find_all('table', {'class': 'table-product'})[1]headers = []rows = []for i, row in enumerate(table.find_all('tr')):if i == 0:headers = [el.text.strip() for el in row.find_all('th')]else:rows.append([el.text.strip() for el in row.find_all('td')])print(headers)print(rows)from bs4 import BeautifulSoup import requests response = requests.get("https://web-scraping.dev/product/1") html = response.text soup = BeautifulSoup(html, "lxml") # First, select the desired table element (the 2nd one on the page) table = soup.find_all('table', {'class': 'table-product'})[1] headers = [] rows = [] for i, row in enumerate(table.find_all('tr')): if i == 0: headers = [el.text.strip() for el in row.find_all('th')] else: rows.append([el.text.strip() for el in row.find_all('td')]) print(headers) print(rows)from bs4 import BeautifulSoup import requests response = requests.get("https://web-scraping.dev/product/1") html = response.text soup = BeautifulSoup(html, "lxml") # First, select the desired table element (the 2nd one on the page) table = soup.find_all('table', {'class': 'table-product'})[1] headers = [] rows = [] for i, row in enumerate(table.find_all('tr')): if i == 0: headers = [el.text.strip() for el in row.find_all('th')] else: rows.append([el.text.strip() for el in row.find_all('td')]) print(headers) print(rows)
Enter fullscreen mode Exit fullscreen mode
Example Output
['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type'][['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping'], ['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping'], ['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping'], ['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping'], ['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']]['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type'] [['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping'], ['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping'], ['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping'], ['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping'], ['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']]['Version', 'Package Weight', 'Package Dimension', 'Variants', 'Delivery Type'] [['Pack 1', '1,00 kg', '100x230 cm', '6 available', '1 Day shipping'], ['Pack 2', '2,11 kg', '200x460 cm', '6 available', '1 Day shipping'], ['Pack 3', '3,22 kg', '300x690 cm', '6 available', '1 Day shipping'], ['Pack 4', '4,33 kg', '400x920 cm', '6 available', '1 Day shipping'], ['Pack 5', '5,44 kg', '500x1150 cm', '6 available', '1 Day shipping']]
Enter fullscreen mode Exit fullscreen mode
In the above code, we’re identifying and parsing HTML tables, extracting both headers and data rows. The function handles various table structures, including those with and without explicit header elements. This approach gives you structured data that preserves the relationships between columns and rows.
Crawling Challenges
When crawling tables, it’s important to look beyond the obvious <table>
elements. Many modern websites implement table-like layouts using CSS grid, flexbox, or other techniques. Identifying these structures requires careful inspection of the DOM and adapting your selectors accordingly.
All table structures are easy to handle using beautifulsoup, CSS Selectors or XPath powered algorithms though for more generic solutions can use LLMs and AI. One commonly used technique is to use LLMs to convert HTML to Markdown format which can often create accurate tables from flexible HTML table structures.
Now, let’s explore how to crawl search engine results pages for list-type content.
SERP List Crawling
Search Engine Results Pages (SERPs) offer a treasure trove of list-based content, presenting curated links to pages relevant to specific keywords. Crawling SERPs can help you discover list articles and other structured content across the web.
Example Crawler
Here’s a basic approach to crawling Google search results:
Python
ScrapFly AI
import requestsfrom bs4 import BeautifulSoupimport urllib.parsedef crawl_google_serp(query, num_results=10):# Format the query for URLencoded_query = urllib.parse.quote(query)# Create Google search URLurl = f"https://www.google.com/search?q={encoded_query}&num={num_results}"# Add headers to mimic a browserheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36","Accept-Language": "en-US,en;q=0.9"}response = requests.get(url, headers=headers)soup = BeautifulSoup(response.text, "html.parser")# Extract search resultsresults = []# Target the organic search resultsfor result in soup.select("div.g"):title_element = result.select_one("h3")if title_element:title = title_element.text# Extract URLlink_element = result.select_one("a")link = link_element.get("href") if link_element else None# Extract snippetsnippet_element = result.select_one("div.VwiC3b")snippet = snippet_element.text if snippet_element else Noneresults.append({"title": title,"url": link,"snippet": snippet})return resultsfrom scrapfly import ScrapflyClient, ScrapeConfigscrapfly = ScrapflyClient(key="YOUR-SCRAPFLY-KEY")result = scrapfly.scrape(ScrapeConfig(url="https://www.google.com/search?q=python"# select country to get localized resultscountry="us",# enable cloud browsersrender_js=True,# scroll to the bottom of the pageauto_scroll=True,# use AI to extract dataextraction_model="search_engine_results",))print(result.content)import requests from bs4 import BeautifulSoup import urllib.parse def crawl_google_serp(query, num_results=10): # Format the query for URL encoded_query = urllib.parse.quote(query) # Create Google search URL url = f"https://www.google.com/search?q={encoded_query}&num={num_results}" # Add headers to mimic a browser headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", "Accept-Language": "en-US,en;q=0.9" } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, "html.parser") # Extract search results results = [] # Target the organic search results for result in soup.select("div.g"): title_element = result.select_one("h3") if title_element: title = title_element.text # Extract URL link_element = result.select_one("a") link = link_element.get("href") if link_element else None # Extract snippet snippet_element = result.select_one("div.VwiC3b") snippet = snippet_element.text if snippet_element else None results.append({ "title": title, "url": link, "snippet": snippet }) return results from scrapfly import ScrapflyClient, ScrapeConfig scrapfly = ScrapflyClient(key="YOUR-SCRAPFLY-KEY") result = scrapfly.scrape(ScrapeConfig( url="https://www.google.com/search?q=python" # select country to get localized results country="us", # enable cloud browsers render_js=True, # scroll to the bottom of the page auto_scroll=True, # use AI to extract data extraction_model="search_engine_results", )) print(result.content)import requests from bs4 import BeautifulSoup import urllib.parse def crawl_google_serp(query, num_results=10): # Format the query for URL encoded_query = urllib.parse.quote(query) # Create Google search URL url = f"https://www.google.com/search?q={encoded_query}&num={num_results}" # Add headers to mimic a browser headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", "Accept-Language": "en-US,en;q=0.9" } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, "html.parser") # Extract search results results = [] # Target the organic search results for result in soup.select("div.g"): title_element = result.select_one("h3") if title_element: title = title_element.text # Extract URL link_element = result.select_one("a") link = link_element.get("href") if link_element else None # Extract snippet snippet_element = result.select_one("div.VwiC3b") snippet = snippet_element.text if snippet_element else None results.append({ "title": title, "url": link, "snippet": snippet }) return results from scrapfly import ScrapflyClient, ScrapeConfig scrapfly = ScrapflyClient(key="YOUR-SCRAPFLY-KEY") result = scrapfly.scrape(ScrapeConfig( url="https://www.google.com/search?q=python" # select country to get localized results country="us", # enable cloud browsers render_js=True, # scroll to the bottom of the page auto_scroll=True, # use AI to extract data extraction_model="search_engine_results", )) print(result.content)
Enter fullscreen mode Exit fullscreen mode
In the above code, we’re constructing a Google search query URL, sending an HTTP request with browser-like headers, and then parsing the HTML to extract organic search results. Each result includes the title, URL, and snippet text, which can help you identify list-type content for further crawling.
Crawling Challenges
It’s worth noting that directly crawling search engines can be challenging due to very strong anti-bot measures. For production applications, you may need to consider more sophisticated techniques to avoid blocks and for that see our blocking bypass introduction tutorial.
Scrapfly can easily bypass all SERP blocking measures and return AI extracted data for any SERP page using AI Web Scraping API.
To wrap up – let’s move on to some frequently asked questions about list crawling.
FAQ
Below are quick answers to common questions about list crawling techniques and best practices:
<!–kg-card-end: markdown–><!–kg-card-begin: html–>
What is the difference between list crawling and general web scraping?
List crawling focuses on extracting structured data from lists, such as paginated content, infinite scrolls, and tables. General web scraping targets various elements across different pages, while list crawling requires specific techniques for handling pagination, scroll events, and nested structures.
How do I handle rate limiting when crawling large lists?
Use adaptive delays (1-3 seconds) and increase them if you get 429 errors. Implement exponential backoff for failed requests and rotate proxies to distribute traffic. A request queuing system helps maintain a steady and sustainable request rate.
How can I extract structured data from deeply nested lists?
Identify nesting patterns using developer tools. Use a recursive function to process items and their children while preserving relationships. CSS selectors, XPath, and depth-first traversal help extract data while maintaining hierarchy.
<!–kg-card-end: html–><!–kg-card-begin: markdown–>
Summary
List crawling is essential for extracting structured data from the web’s many list formats. From product catalogs and social feeds to nested articles and data tables, each list type requires a tailored approach.
This guide has covered:
- Setting up basic crawlers with Python libraries like BeautifulSoup and requests
- Handling paginated lists that split content across multiple pages
- Tackling endless scroll lists with headless browsers
- Extracting structured data from article-based lists
- Processing tabular data for row-column relationships
- Crawling search engine results to discover more list content
The techniques demonstrated here from HTTP requests for static content to browser automation for dynamic pages provide powerful tools for transforming unstructured web data into valuable, actionable insights.
暂无评论内容