Scrape Qwant Organic and Ad Results using Python

A tutorial blog post that guides through position, title, link, displayed link, snippet, and favicon extraction process from qwant.com using Python.

Briefly about the essence: tutorial blog post about scraping: website position for SEO rank tracking, title, link, displayed link, snippet, and favicon data from qwant.com search results using Python.

What is required: understanding of loops, data structures, exception handling, and basic knowledge of CSS selectors. bs4, requests, lxml libraries.

⏱️How long will it take: ~15-20 minutes to read and implement.



What is Qwant Search

Qwant is a European Paris-based no user tracking for advertising search engine with its independent indexing engine and available in 26 languages with more than 30 million individual monthly users worldwide.


What will be scraped

Prerequisites

Basic knowledge scraping with CSS selectors

If you haven’t scraped with CSS selectors, there’s a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they’re matter from a web-scraping perspective.

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

Separate virtual environment

If you didn’t work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

In short, it’s a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus prevention libraries or Python version conflicts.

Install libraries:

pip install requests
pip install lxml 
pip install beautifulsoup4

Enter fullscreen mode Exit fullscreen mode

Reduce the chance of being blocked

There’s a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there’s eleven methods to bypass blocks from most websites.


Process

If you don’t need an explanation:

Starting code for both organic and ad results:

from bs4 import BeautifulSoup
import requests, lxml, json

headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 10; HD1913) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36 EdgA/46.1.2.5140"
}

params = {
    "q": "minecraft",
    "t": "web"
}


html = requests.get("https://www.qwant.com/", params=params, headers=headers, timeout=20)
soup = BeautifulSoup(html.text, "lxml")
# further code... 

Enter fullscreen mode Exit fullscreen mode

Import libraries:

from bs4 import BeautifulSoup
import requests, lxml, json

Enter fullscreen mode Exit fullscreen mode

Add user-agent and query parameters to request:

headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 10; HD1913) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36 EdgA/46.1.2.5140"
}

params = {
    "q": "minecraft",  # search query     "t": "web"         # qwant query argument for displaying web results }

Enter fullscreen mode Exit fullscreen mode

Make a request, add timeout argument, create BeautifulSoup() object:

html = requests.get("https://www.qwant.com/", params=params, headers=headers, timeout=20)
soup = BeautifulSoup(html.text, "lxml")

Enter fullscreen mode Exit fullscreen mode

  • timeout parameter will tell requests to stop waiting for response after a X number of seconds.
  • BeautifulSoup() is what pulls all the HTML data. lxml is an HTML parser.

Extract Organic Results

def scrape_organic_results():

    organic_results_data = []

    for index, result in enumerate(soup.select("[data-testid=webResult]"), start=1):
        title = result.select_one(".WebResult-module__title___MOBFg").text
        link = result.select_one(".Stack-module__VerticalStack___2NDle.Stack-module__Spacexxs___3wU9G a")["href"]
        snippet = result.select_one(".Box-module__marginTopxxs___RMB_d").text

        try:
            displayed_link = result.select_one(".WebResult-module__permalink___MJGeh").text
            favicon = result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]
        except:
            displayed_link = None
            favicon = None

        organic_results_data.append({
            "position": index,
            "title": title,
            "link": link,
            "displayed_link": displayed_link,
            "snippet": snippet,
            "favicon": favicon
        })

    print(json.dumps(organic_results_data, indent=2))


scrape_oragnic_results()

Enter fullscreen mode Exit fullscreen mode

Create temporary list() to store extracted data:

organic_results_data = []

Enter fullscreen mode Exit fullscreen mode

Iterate and extract the data:

for index, result in enumerate(soup.select("[data-testid=webResult]"), start=1):
    title = result.select_one(".WebResult-module__title___MOBFg").text
    link = result.select_one(".Stack-module__VerticalStack___2NDle.Stack-module__Spacexxs___3wU9G a")["href"]
    snippet = result.select_one(".Box-module__marginTopxxs___RMB_d").text

    try:
        displayed_link = result.select_one(".WebResult-module__permalink___MJGeh").text
        favicon = result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]
    except:
        displayed_link = None
        favicon = None

Enter fullscreen mode Exit fullscreen mode

To get the position index, we can use enumerate() function which adds a counter to an iterable and returns it and set start to 1 so the count would start from 1, not from 0.

To handle None values, we can use try/except block so if there’s nothing on the Qwant backend, we’ll set it to None as well, otherwise it will throw an error saying that there’s no such element or attribute.

Append extracted data to temporary list() as a dictionary:

organic_results_data.append({
    "position": index,
    "title": title,
    "link": link,
    "displayed_link": displayed_link,
    "snippet": snippet,
    "favicon": favicon
})

Enter fullscreen mode Exit fullscreen mode

Print the data:

print(json.dumps(organic_results_data, indent=2))


# part of the output: ''' [ { "position": 1, "title": "Minecraft Official Site | Minecraft", "link": "https://www.minecraft.net/", "displayed_link": "minecraft.net", "snippet": "Get all-new items in the Minecraft Master Chief Mash-Up DLC on 12/10, and the Superintendent shirt in Character Creator, free for a limited time! Learn more. Climb high and dig deep. Explore bigger mountains, caves, and biomes along with an increased world height and updated terrain generation in the Caves & Cliffs Update: Part II! Learn more . Play Minecraft games with Game Pass. Get your ...", "favicon": "https://s.qwant.com/fav/m/i/www_minecraft_net.ico" }, ... other results { "position": 10, "title": "Minecraft - download free full version game for PC ...", "link": "http://freegamepick.net/en/minecraft/", "displayed_link": "freegamepick.net", "snippet": "Minecraft Download Game Overview. Minecraft is a game about breaking and placing blocks. It's developed by Mojang. At first, people built structures to protect against nocturnal monsters, but as the game grew players worked together to create wonderful, imaginative things. It can als o be about adventuring with friends or watching the sun rise over a blocky ocean.", "favicon": "https://s.qwant.com/fav/f/r/freegamepick_net.ico" } ] '''

Enter fullscreen mode Exit fullscreen mode


Extract Ad Results

def scrape_ad_results():

    ad_results_data = []

    for index, ad_result in enumerate(soup.select("[data-testid=adResult]"), start=1):
        ad_title = ad_result.select_one(".WebResult-module__title___MOBFg").text
        ad_link = ad_result.select_one(".Stack-module__VerticalStack___2NDle a")["href"]
        ad_displayed_link = ad_result.select_one(".WebResult-module__domain___1LJmo").text
        ad_snippet = ad_result.select_one(".Box-module__marginTopxxs___RMB_d").text
        ad_favicon = ad_result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]

        ad_results_data.append({
            "ad_position": index,
            "ad_title": ad_title,
            "ad_link": ad_link,
            "ad_displayed_link": ad_displayed_link,
            "ad_snippet": ad_snippet,
            "ad_favicon": ad_favicon
        })

    print(json.dumps(ad_results_data, indent=2))


scrape_ad_results()

Enter fullscreen mode Exit fullscreen mode

Create temporary list() to store extracted data:

ad_results_data = []

Enter fullscreen mode Exit fullscreen mode

Iterate and extract:

for index, ad_result in enumerate(soup.select("[data-testid=adResult]"), start=1):
    ad_title = ad_result.select_one(".WebResult-module__title___MOBFg").text
    ad_link = ad_result.select_one(".Stack-module__VerticalStack___2NDle a")["href"]
    ad_displayed_link = ad_result.select_one(".WebResult-module__domain___1LJmo").text
    ad_snippet = ad_result.select_one(".Box-module__marginTopxxs___RMB_d").text
    ad_favicon = ad_result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]

Enter fullscreen mode Exit fullscreen mode

The same approach was used to get the position index. The only difference is different CSS “container” selector [data-testid=adResult] while in organic results it’s [data-testid=webResult].

Append extracted data to temporary list() as a dictionary:

ad_results_data.append({
    "ad_position": index,
    "ad_title": ad_title,
    "ad_link": ad_link,
    "ad_displayed_link": ad_displayed_link,
    "ad_snippet": ad_snippet
})

Enter fullscreen mode Exit fullscreen mode

Print the data:

print(json.dumps(ad_results_data, indent=2))

# output: ''' [ { "ad_position": 1, "ad_title": "Watch Movies & TV on Amazon - Download in HD on Amazon Video", "ad_link": "https://www.bing.com/aclick?ld=e8pyYjhclU87kOyQ4ap78CRzVUCUxgK0MGMfKx1YlQe_w7Nbzamra9cSRmPFAtSOVF4MliAqbJNdotR3G-aqHSaMOI0tqV9K0EAFRTemYDKhbqLyjFW93Lsh0mnyySb8oIj6GXADnoePUk-etFDgSvPdZI0xObBo4hesqbOHypYhSGeJ-ZbG1eY0kijv95k0XJ9WKPPA&u=aHR0cHMlM2ElMmYlMmZ3d3cuYW1hem9uLmNvLnVrJTJmcyUyZiUzZmllJTNkVVRGOCUyNmtleXdvcmRzJTNkbWluZWNyYWZ0JTJidGhlJTI2aW5kZXglM2RhcHMlMjZ0YWclM2RoeWRydWtzcG0tMjElMjZyZWYlM2RwZF9zbF8ydmdscmFubWxwX2UlMjZhZGdycGlkJTNkMTE0NDU5MjQzNjk0ODQzOSUyNmh2YWRpZCUzZDcxNTM3MTUwMzgzNDA4JTI2aHZuZXR3JTNkcyUyNmh2cW10JTNkZSUyNmh2Ym10JTNkYmUlMjZodmRldiUzZG0lMjZodmxvY2ludCUzZCUyNmh2bG9jcGh5JTNkMTQxMTcxJTI2aHZ0YXJnaWQlM2Rrd2QtNzE1Mzc2Njc4MjI5NzklM2Fsb2MtMjM1JTI2aHlkYWRjciUzZDU5MTJfMTg4MTc4NQ&rlid=c61aa73b62e916116cbdc687c021190a", "ad_displayed_link": "amazon.co.uk", "ad_snippet": "Download now. Watch anytime on Amazon Video.", "ad_favicon": "https://s.qwant.com/fav/a/m/www_amazon_co_uk.ico" } ] '''

Enter fullscreen mode Exit fullscreen mode


Full Code

from bs4 import BeautifulSoup
import requests, lxml, json

headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 10; HD1913) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Mobile Safari/537.36 EdgA/46.1.2.5140"
}

params = {
    "q": "minecraft",
    "t": "web"
}

html = requests.get("https://www.qwant.com/", params=params, headers=headers, timeout=20)
soup = BeautifulSoup(html.text, "lxml")


def scrape_organic_results():

    organic_results_data = []

    for index, result in enumerate(soup.select("[data-testid=webResult]"), start=1):
        title = result.select_one(".WebResult-module__title___MOBFg").text
        link = result.select_one(".Stack-module__VerticalStack___2NDle.Stack-module__Spacexxs___3wU9G a")["href"]
        snippet = result.select_one(".Box-module__marginTopxxs___RMB_d").text

        try:
            displayed_link = result.select_one(".WebResult-module__permalink___MJGeh").text
            favicon = result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]
        except:
            displayed_link = None
            favicon = None

        organic_results_data.append({
            "position": index,
            "title": title,
            "link": link,
            "displayed_link": displayed_link,
            "snippet": snippet,
            "favicon": favicon
        })

    print(json.dumps(organic_results_data, indent=2))


def scrape_ad_results():

    ad_results_data = []

    for index, ad_result in enumerate(soup.select("[data-testid=adResult]"), start=1):
        ad_position = index + 1
        ad_title = ad_result.select_one(".WebResult-module__title___MOBFg").text
        ad_link = ad_result.select_one(".Stack-module__VerticalStack___2NDle a")["href"]
        ad_displayed_link = ad_result.select_one(".WebResult-module__domain___1LJmo").text
        ad_snippet = ad_result.select_one(".Box-module__marginTopxxs___RMB_d").text
        ad_favicon = ad_result.select_one(".WebResult-module__iconBox___3DAv5 img")["src"]

        ad_results_data.append({
            "ad_position": index,
            "ad_title": ad_title,
            "ad_link": ad_link,
            "ad_displayed_link": ad_displayed_link,
            "ad_snippet": ad_snippet,
            "ad_favicon": ad_favicon
        })

    print(json.dumps(ad_results_data, indent=2))

Enter fullscreen mode Exit fullscreen mode


Links


Outro

If you have anything to share, any questions or suggestions to this blog post, feel free to reach via comments section or via Twitter at @dimitryzub, or @serp_api.

Yours,
Dimitry, and the rest of SerpApi Team.


Join us on Reddit | Twitter | YouTube

Add a Feature Request or a Bug

原文链接:Scrape Qwant Organic and Ad Results using Python

© 版权声明
THE END
喜欢就支持一下吧
点赞12 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容