How to Scrape Forms

Data on web pages don’t always pre-exist on the HTML. Instead, they often require steps before they are rendered using browser actions. One of such action is forms, which render data behind clicks or entering credentials.

In this article, we’ll explain how to scrape forms using both HTTP clients and headless browsers. Let’s get started!

Scrape Forms With HTTP Clients

Let’s start our guide by scraping forms with HTTP clients. For this, we’ll use two Python community packages:

Since asyncio is pre-installed in Python, we only need to install the other two packages using the following pip command:

pip install httpx parsel

Enter fullscreen mode Exit fullscreen mode

Note that httpx can be replaced with any other HTTP client, such as requests. As for parsel, another alternative is BeautifulSoup.

In this guide, we’ll use the login page on web-scraping.dev as our tutorial’s target web page:

图片[1]-How to Scrape Forms - 拾光赋-拾光赋
Login page on web-scraping.dev

The data we are interested in are accessible by scraping login form:

图片[2]-How to Scrape Forms - 拾光赋-拾光赋
Scraping target page

To scrape data from the above login form example using HTTP requests, we can follow either of two methods:

  • Provide the request with the necessary headers and cookies.
  • Simulate the form POST request responsible for auth values.

Both of the above form scraping methods are similar, except for the way they handle the required request configuration. The first approach is ideal for requests that have static request configuration. However, in most cases, these values expire over time , making the second approach more suitable.

We have previously covered the first method through cookies. In this one, we’ll replicate the form POST request.

To start, let’s observe the background requests responsible for retrieving the target page data:

  • Open the browser developer tools by pressing the F12 key.
  • Add the login form credential credentials and click the login button.

After following the above steps, you will find the responsible background requests captured:

图片[3]-How to Scrape Forms - 拾光赋-拾光赋
Login background requests

From the above image, we can see that there are two requests sent to the same URL:

  1. Form POST request with the login credentials in the payload, which returns the authentication token with the response.
  2. GET request with the retrieved token in the payload to retrieve the HTML data behind the login form.

To scrape the form of our target web page, we’ll replicate the above two requests within our script. The easiest way to replicate them is by copying as cURL or importing them to Postman:

import asyncio
from parsel import Selector
from httpx import AsyncClient

# initializing a async httpx client client = AsyncClient(
    headers = {
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "cookiesAccepted=true",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
    }
)

async def retrieve_auth_token(url: str):
    """parse the auth token value from the response headers"""
    # login credentials     body = "username=user123&password=password"
    response = await client.post(url, data=body)
    return response.headers["set-cookie"].split(";")[0]

async def scrape_forms(url: str):
    """scrape login form and parse the data behind it"""
    token = await retrieve_auth_token(url="https://web-scraping.dev/api/login")
    client.headers["cookie"] = f"{token};cookiesAccepted=true"
    response = await client.get(url)
    selector = Selector(response.text)
    # parse the retrieved HTML data     print(f"the secret message is {selector.css('div#secret-message::text').get()}")
    "the secret message is 🤫"

if __name__ == " __main__":
    asyncio.run(scrape_forms(
        url="https://web-scraping.dev/login"
    ))

Enter fullscreen mode Exit fullscreen mode

In the above code, we start by initiating an httpx client with the request headers and use it to send the form POST request for the auth token cookie retrieval. Then, we utilize the obtained auth token while requesting the target web page itself.

Note that the first request for retrieving the auth token can follow redirects, which dispenses the need for manual cookie configuration:

import asyncio
from parsel import Selector
from httpx import AsyncClient

client = AsyncClient(
    headers = {
        # same request headers     },
    follow_redirects=True # enable rediretions )

async def retrieve_auth_token(url: str):
    """parse the auth token value from the response headers"""
    # login credentials     body = "username=user123&password=password"
    response = await client.post(url, data=body)
    selector = Selector(response.text)
    print(f"the secret message is {selector.css('div#secret-message::text').get()}")
    "the secret message is 🤫"

Enter fullscreen mode Exit fullscreen mode

For further details on web scraping hidden APIs, refer to our dedicated guide.

Scrape Forms With Headless Browsers

Another alternative for scraping forms is using headless browsers. We’ll explore automating the regular browser actions using Selenium, Playwright, and Puppeteer.

Selenium

Let’s start with Selenium. Since it requires the web driver binaries to be installed, we’ll install Selenium alongside webdriver-manager using the following pip command:

pip install selenium webdriver-manager

Enter fullscreen mode Exit fullscreen mode

After the webdriver-manager is installed, use it to download the driver binaries automatically for the desired browser:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

Enter fullscreen mode Exit fullscreen mode

To scrape forms with Selenium, we’ll use Selenium’s expected conditions and find elements to navigate the form:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

driver.get("https://web-scraping.dev/login?cookies=")

# define a timeout wait = WebDriverWait(driver, timeout=5)

# accept the cookie policy wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#cookie-ok")))
driver.find_element(By.CSS_SELECTOR, "button#cookie-ok").click()

# wait for the login form wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']")))

# fill in the login credentails username_button = driver.find_element(By.CSS_SELECTOR, "input[name='username']")
username_button.clear()
username_button.send_keys("user123")

password_button = driver.find_element(By.CSS_SELECTOR, "input[name='password']")
password_button.clear()
password_button.send_keys("password")

# click the login submit button driver.find_element(By.CSS_SELECTOR, "button[type='submit']").click()

# wait for an element on the login redirected page wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div#secret-message")))

secret_message = driver.find_element(By.CSS_SELECTOR, "div#secret-message").text
print(f"The secret message is: {secret_message}")
"The secret message is: 🤫"

# close the browser driver.quit()

Enter fullscreen mode Exit fullscreen mode

The workflow of the above form scraping code is fairly straightforward: the driver waits for the elements to load, fills, and clicks them. Finally, it extracts form data.

For further details on web scraping with Selenium, refer to our dedicated guide.

Playwright

Playwright is a straightforward Python library for browser automation. First, install it using the following pip command:

pip install playwright

Enter fullscreen mode Exit fullscreen mode

After Playwright is successfully installed, use the playwright install command to download the desired browser driver binaries:

playwright install chromium

Enter fullscreen mode Exit fullscreen mode

Similar to the previous code snippet, we’ll scrape forms with Playwright using timeouts and click methods:

from playwright.sync_api import sync_playwright

with sync_playwright() as playwright:
    browser = playwright.chromium.launch()
    context = browser.new_context()
    page = context.new_page()

    # request the target web page     page.goto("https://web-scraping.dev/login?cookies=")

    # accept the cookie policy     page.click("button#cookie-ok")

    # wait for the login form     page.wait_for_selector("button[type='submit']")

    # wait for the page to fully load     page.wait_for_load_state("networkidle")

    # fill in the login credentials     page.fill("input[name='username']", "user123")
    page.fill("input[name='password']", "password")

    # click the login submit button     page.click("button[type='submit']")

    # wait for an element on the login redirected page     page.wait_for_selector("div#secret-message")

    secret_message = page.inner_text("div#secret-message")
    print(f"The secret message is {secret_message}")
    "The secret message is 🤫"

Enter fullscreen mode Exit fullscreen mode

Here, we use the Playwright page methods to wait for selectors or load states. Then, we navigate the form by clicking and filling elements. Finally, the script parses the final web page data using CSS selectors.

Have a look at our dedicated guide on Playwright for the full details on using it for web scraping.

Puppeteer

Finally, let’s explore using Puppeteer to extract form data. First, install it using npm:

npm install puppeteer

Enter fullscreen mode Exit fullscreen mode

Since Puppeteer automatically downloads the related Chrome driver binaries, all we have to do is declare the browser actions directly:

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch({
        headless: false
    })

    // create a browser page
    const page = await browser.newPage();

    // go to the target web page
    await page.goto(
        'https://web-scraping.dev/login?cookies=',
        { waitUntil: 'domcontentloaded' }
    );

    // wait for 500 milliseconds
    await new Promise(resolve => setTimeout(resolve, 500));

    // accept the cookie policy
    await page.click('button#cookie-ok');

    // wait for the login form
    await page.waitForSelector('button[type="submit"]');

    // fill in the login credentials
    await page.$eval('input[name="username"]', (el, value) => el.value = value, 'user123');
    await page.$eval('input[name="password"]', (el, value) => el.value = value, 'password');
    await new Promise(resolve => setTimeout(resolve, 500));

    // click the login button and wait for navigation
    await page.click('button[type="submit"]');
    await page.waitForSelector('div#secret-message');

    secretMessage = await page.$eval('div#secret-message', node => node.innerHTML);
    console.log(`The secret message is ${secretMessage}`);

    // close the browser
    await browser.close();
})();

Enter fullscreen mode Exit fullscreen mode

Let’s break down the above form scraping code:

  • Initiate a Puppeteer instance in the headful mode.
  • Define timeouts to wait for the elements.
  • Fill and click the elements using their equivalent selectors.
  • Parse the HTML behind the login page.

For further details on web scraping with Puppeteer, refer to our dedicated guide.

Powering Up With ScrapFly

We have explored using both HTTP clients and headless browsers for scraping forms. While HTTP clients are more efficient and use less resources , they are more complicated and can be challanging with complex use cases.

ScrapFly is a web scraping API that allows JavaScript rendering through cloud headless browsers for efficient web scraping. Moreover, it allows for scraping at scale by providing:

图片[4]-How to Scrape Forms - 拾光赋-拾光赋
ScrapFly service does the heavy lifting for you

Here’s how we can approach our form scraping example using ScrapFly. All we have to do is enable the asp parameter to bypass web scraping blocking, enable the render_js feature, and automate the headless browser by declaring the js_scenario steps:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        # target website URL         url="https://web-scraping.dev/login",
        # bypass anti scraping protection         asp=True,        
        # set the proxy location to a specific country         country="US",
        # # enable the cookies policy         headers={"cookie": "cookiesAccepted=true"},
        # enable JavaScript rendering         render_js=True,
        # scroll down the page automatically         auto_scroll=True,
        # add JavaScript scenarios         js_scenario=[
            {"click": {"selector": "button#cookie-ok"}},
            {"fill": {"selector": "input[name='username']","clear": True,"value": "user123"}},
            {"fill": {"selector": "input[name='password']","clear": True,"value": "password"}},
            {"click": {"selector": "form > button[type='submit']"}},
            {"wait_for_navigation": {"timeout": 5000}}
        ],
        # take a screenshot         screenshots={"logged_in_screen": "fullpage"},
        debug=True
    )
)

# get the HTML from the response html = api_response.scrape_result['content']

# use the built-in Parsel selector selector = api_response.selector
print(f"The secret message is {selector.css('div#secret-message::text').get()}")
"The secret message is 🤫"

Enter fullscreen mode Exit fullscreen mode

Try for FREE!

More on Scrapfly

FAQ

To wrap up this guide, let’s have a look at some frequently questions about web scraping forms.

How to scrape login forms?

Login forms require authentication tokens to present in the request payload. These auth tokens can be added either as a request header or cookie to authoritze the data behind the login form.

How to send form POST requests using web scraping?

To send a form POST request while web scraping, change the request HTTP method and add the form body to the request payload:

import httpx # or requests 
# send the form POST request with the form body login_response = httpx.post(
    "https://web-scraping.dev/api/login",
    data={
        "username": "user123",
        "password": "password",
    },
)
print(dict(login_response.cookies))
{"auth": "user123-secret-token"}
# use the cookie with further requests 

Enter fullscreen mode Exit fullscreen mode

Summary

In this guide, we explained how to scrape forms using Python. We went through a step-by-step guide on extracting data behind forms using httpx, Selenium, Playwright, and Puppeteer through either of two approaches:

  • Replicating form POST requests using HTTP requests.
  • Using headless browsers to automate the form submission.

原文链接:How to Scrape Forms

© 版权声明
THE END
喜欢就支持一下吧
点赞8 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容