Scrape Naver Organic Results with Python

Naver.com Web Scraping (5 Part Series)

1 Scrape Naver News Results with Python
2 Scrape Naver Organic Results with Python
3 Scrape all Naver Video Results using pagination in Python
4 Scrape Naver Video Results in Python
5 Scrape Naver Related Search Results with Python

What is Naver Search

I already answered this in my first blog about scraping Naver News results, there you can find information about what Naver Search is.

Intro

This tutorial blog post is a continuation of the Naver web scraping series. Here you’ll see how to scrape Naver Organic Results website ranking, title, link, displayed link, and a snippet with Python using beautifulsoup, requests, lxml libraries.

Note: This blog post shows how to extract data that is being shown in the what will be scraped section.

Prerequisites and Imports

pip install requests
pip install lxml 
pip install beautifulsoup4

Enter fullscreen mode Exit fullscreen mode

  • Basic knowledge of Python.
  • Basic familiarity of the packages mentioned above.
  • Basic understanding of CSS selectors because you’ll see mostly usage of select()/select_one() beautifulsoup methods that accept CSS selectors.

I wrote a dedicated blog about web scraping with CSS selectors to cover what it is, pros and cons, and why they’re matter from a web-scraping perspective.

Imports

import requests, lxml
from bs4 import BeautifulSoup

Enter fullscreen mode Exit fullscreen mode

What will be scraped


Process

If you don’t need an explanation, jump to the code section.

We need to take three steps to make:

  1. Save HTML locally to test everything before making a lot of direct requests.
  2. Pick CSS selectors for all the needed data.
  3. Extract the data.

Save HTML to test the parser locally

Saving HTML locally prevents blocking or banning IP address, especially when a bunch of requests needs to be made to the same website in order to test the code.

A normal user won’t do 100+ requests in a very short period of time, and don’t do the same thing over and over again (pattern) as scripts do, so websites might tag this behavior as unusual and block IP address for some period (might be written in the response: requests.get("URL").text) or ban permanently.

import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "query": "bruce lee",
    "where": "web"        # theres's also a "nexearch" param that will produce different results }

def save_naver_organic_results():
    html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text

    # replacing every space to underline (_) so bruce lee will become bruce_lee     query = params['query'].replace(" ", "_")

    with open(f"{query}_naver_organic_results.html", mode="w") as file:
        file.write(html)

Enter fullscreen mode Exit fullscreen mode

Now, what’s happening here

Import requests library
import requests

Enter fullscreen mode Exit fullscreen mode

Add user-agent and query parameters

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

# query parameters params = {
    "query": "bruce lee",
    "where": "web"
}

Enter fullscreen mode Exit fullscreen mode

I tend to pass query parameters to requests.get(params=params) instead of leaving them in the URL. I find it more readable, for example, let’s look at the exact same URL:

params = {
    "where": "web",
    "sm": "top_hty",
    "fbm": "1",
    "ie": "utf8",
    "query": "bruce+lee"
}
requests.get("https://search.naver.com/search.naver", params=params)

# VS 
requests.get("https://search.naver.com/search.naver?where=web&sm=top_hty&fbm=1&ie=utf8&query=bruce+lee")  # Press F. 

Enter fullscreen mode Exit fullscreen mode

What about user-agent, it’s needed to act as a “real” user visit otherwise the request might be denied. You can read more about it in my other blog post about how to reduce the chance of being blocked while web scraping search engines.


Pick and test CSS selectors

Selecting container (CSS selector that wraps all needed data), title, link, displayed link, and a snippet.

The GIF above translates to this code snippet:

for result in soup.select(".total_wrap"):
    title = result.select_one(".total_tit").text.strip()
    link = result.select_one(".total_tit .link_tit")["href"]
    displayed_link = result.select_one(".total_source").text.strip()
    snippet = result.select_one(".dsc_txt").text

Enter fullscreen mode Exit fullscreen mode


Extract data

import lxml, json
from bs4 import BeautifulSoup


def extract_local_html_naver_organic_results():
    with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
        html = html_file.read()
        soup = BeautifulSoup(html, "lxml")

        data = []

        for index, result in enumerate(soup.select(".total_wrap")):
            title = result.select_one(".total_tit").text.strip()
            link = result.select_one(".total_tit .link_tit")["href"]
            displayed_link = result.select_one(".total_source").text.strip()
            snippet = result.select_one(".dsc_txt").text

            data.append({
                "position": index + 1, # starts from 1, not from 0                 "title": title,
                "link": link,
                "displayed_link": displayed_link,
                "snippet": snippet
            })

        print(json.dumps(data, indent=2, ensure_ascii=False))

Enter fullscreen mode Exit fullscreen mode

Now let’s break down the extraction part

Import bs4, lxml, json libraries
import lxml, json
from bs4 import BeautifulSoup

Enter fullscreen mode Exit fullscreen mode

Open saved HTML file, read it and pass it to BeautifulSoup() object and assign lxml as an HTML parser
with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
    html = html_file.read()
    soup = BeautifulSoup(html, "lxml")

Enter fullscreen mode Exit fullscreen mode

Create temporary list() to store extracted data
data = []

Enter fullscreen mode Exit fullscreen mode

Iterate and append as a dictionary to temporary list()

Since we also need to get an index (rank position), we can use enumerate() method which adds a counter to an iterable and returns it. More examples.

Example:

grocery = ["bread", "milk", "butter"]  # iterable 
for index, item in enumerate(grocery):
  print(f"{index} {item}\n")

''' 0 bread 1 milk 2 butter '''

Enter fullscreen mode Exit fullscreen mode

Actual code:

# in our case iterable is soup.select() since it returns an iterable as well for index, result in enumerate(soup.select(".total_wrap")):
    title = result.select_one(".total_tit").text.strip()
    link = result.select_one(".total_tit .link_tit")["href"]
    displayed_link = result.select_one(".total_source").text.strip()
    snippet = result.select_one(".dsc_txt").text

    data.append({
        "position": index + 1,  # starts from 1, not from 0         "title": title,
        "link": link,
        "displayed_link": displayed_link,
        "snippet": snippet
    })

Enter fullscreen mode Exit fullscreen mode


Full Code

Now when combining all functions together, we’ll get four (4) functions:

  • The first function saves HTML locally.
  • The second function opens local HTML and calls a parser function.
  • The third function makes an actual request and calls a parser function.
  • The fourth function is a parser that’s being called by the second and third functions.

Note: first and second function could be skipped if you don’t really want to do that but take in mind possible consequences that was mentioned above.

import requests
import lxml, json
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "query": "bruce lee",  # search query     "where": "web"         # nexearch will produce different results }


# function that saves HTML locally def save_naver_organic_results():
    html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers).text

    # replacing every spaces so bruce lee will become bruce_lee     query = params['query'].replace(" ", "_")

    with open(f"{query}_naver_organic_results.html", mode="w") as file:
        file.write(html)


# fucntion that opens local HTML and calls a parser function def extract_naver_organic_results_from_html():
    with open("bruce_lee_naver_organic_results.html", mode="r") as html_file:
        html = html_file.read()

        # calls naver_organic_results_parser() function to parse the page         data = naver_organic_results_parser(html)

        print(json.dumps(data, indent=2, ensure_ascii=False))


# function that make an actual request and calls a parser function def extract_naver_organic_results_from_url():
    html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers)

    # calls naver_organic_results_parser() function to parse the page     data = naver_organic_results_parser(html)

    print(json.dumps(data, indent=2, ensure_ascii=False))


# parser that's being called by 2-3 functions def naver_organic_results_parser(html):
    soup = BeautifulSoup(html.text, "lxml")

    data = []

    for index, result in enumerate(soup.select(".total_wrap")):
        title = result.select_one(".total_tit").text.strip()
        link = result.select_one(".total_tit .link_tit")["href"]
        displayed_link = result.select_one(".total_source").text.strip()
        snippet = result.select_one(".dsc_txt").text

        data.append({
            "position": index + 1, # starts from 1, not from 0             "title": title,
            "link": link,
            "displayed_link": displayed_link,
            "snippet": snippet
        })

    return data

Enter fullscreen mode Exit fullscreen mode


Using Naver Web Organic Results API

Alternatively, you can achieve the same results by using SerpApi. SerpApi is a paid API with a free plan.

The difference is that there’s no need to create the parser from scratch, trying to pick the correct CSS selectors and don’t get pissed off when certain selectors don’t work as you expected, plus there’s no need to maintain the parser over time if something in the HTML will be changed and on the next run the script will blow up with an error.

Additionally, there’s no need to bypass blocks from Google (or other search engines), understanding how to scale requests volume because it’s already happening under the hood for the end-users with appropriate plans. Have a try in the playground.

Install SerpApi library

pip install google-search-results

Enter fullscreen mode Exit fullscreen mode

Example code to integrate:

from serpapi import GoogleSearch
import os, json


def serpapi_get_naver_organic_results():
    params = {
        "api_key": os.getenv("API_KEY"),
        "engine": "naver",     # search engine (Google, Bing, DuckDuckGo..)         "query": "Bruce Lee",  # search query         "where": "web"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    data = []

    for result in results["organic_results"]:
        data.append({
            "position": result["position"],
            "title": result["title"],
            "link": result["link"],
            "displayed_link": result["displayed_link"],
            "snippet": result["snippet"]
        })

    print(json.dumps(data, indent=2, ensure_ascii=False))

Enter fullscreen mode Exit fullscreen mode

Let’s see what is happening here

Import serpapi, os, json libraries
from serpapi import GoogleSearch
import os, json

Enter fullscreen mode Exit fullscreen mode

Pass search parameters as a dictionary ({})
params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "naver",                # search engine (Google, Bing, DuckDuckGo..)     "query": "Bruce Lee",             # search query     "where": "web"                    # filter to extract data from organic results }

Enter fullscreen mode Exit fullscreen mode

Data extraction

This is happening under the hood so you don’t have to think about these two lines of code.

search = GoogleSearch(params) # data extraction results = search.get_dict()   # structured JSON which is being called later 

Enter fullscreen mode Exit fullscreen mode

Create a list() to temporary store the data
data = []

Enter fullscreen mode Exit fullscreen mode

Iterate and append() extracted data to a list() as a dictionary ({})
for result in results["organic_results"]:
    data.append({
        "position": result["position"],
        "title": result["title"],
        "link": result["link"],
        "displayed_link": result["displayed_link"],
        "snippet": result["snippet"]
    })

Enter fullscreen mode Exit fullscreen mode

Print added data
print(json.dumps(data, indent=2, ensure_ascii=False))


# ---------------- # part of the output ''' [ { "position": 1, "title": "Bruce Lee", "link": "https://brucelee.com/", "displayed_link": "brucelee.com", "snippet": "New Podcast Episode: #402 Flowing with Dustin Nguyen Watch + Listen to Episode “Your inspiration continues to guide us toward our personal liberation.” - Bruce Lee - More Podcast Episodes HBO Announces Order For Season 3 of Warrior! WARRIOR Seasons 1 & 2 Streaming Now on HBO & HBO Max “Warrior is still the best show you’re" } # other results.. ] '''

Enter fullscreen mode Exit fullscreen mode

If you need more information about the plans, it was explained earlier by SerpApi team member Justin O’Hara in his breakdown of SerpApi’s subscriptions blog post (information is the same except you don’t have to login to the SerpApi website).


Links

Outro

If you have anything to share, any questions, suggestions, or something that isn’t working correctly, feel free to drop a comment in the comment section or via Twitter at @dimitryzub, or @serp_api.

Yours,
Dimitry, and the rest of SerpApi Team.


Join us on Reddit | Twitter | YouTube

Naver.com Web Scraping (5 Part Series)

1 Scrape Naver News Results with Python
2 Scrape Naver Organic Results with Python
3 Scrape all Naver Video Results using pagination in Python
4 Scrape Naver Video Results in Python
5 Scrape Naver Related Search Results with Python

原文链接:Scrape Naver Organic Results with Python

© 版权声明
THE END
喜欢就支持一下吧
点赞15 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容