Introduction to Web Scraping with Selenium And Python

Web scraping is a fast, affordable and reliable way to get data when you need it. What is even better, the data is usually up-to-date. Now, bear in mind that when scraping a website, you might be violating its usage policy and can get kicked out of it. While scraping is mostly legal, there might be some exceptions depending on how you are going to use the data. So make sure you do your research before starting. For a simple personal or open-source project, however, you should be ok.

There are many ways to scrape data, but the one I prefer the most is to use Selenium. It is primarily used for testing as what it basically does is browser automation. In simple language, it creates a robot browser that does things for you: it can get HTML data, scroll, click buttons, etc. The great advantage is that we can tell specifically what HTML data we want so we can organize and store it appropriately.

Selenium is compatible with many programming languages, but this tutorial is going to focus on Python. Check this link to read Selenium (with Python) documentation.

First Steps

To download Selenium use this simple command in your command line:

pip install selenium

If you are working in a Jupyter Notebook, you can do it right there instead of the command line. Just add an exclamation mark in the beginning:

!pip install selenium

After that all you need to do is import the necessary modules:

from selenium.webdriver import Chrome, Firefox

Other browsers are also supported but these two are the most commonly used.

Two simple commands are needed to get started:

browser = Firefox()
(or browser = Chrome() depending on your preference)

This creates an instance of a Firefox WebDriver that will allow us to access all its useful methods and attributes. We assigned it to the variable browser but you are free to choose your own name. A new blank window of the Firefox browser will be automatically opened.

Next get the URL that you want to scrape:

browser.get('https://en.wikipedia.org/wiki/Main_Page')

The get() method will open the URL in the browser and will wait until it is fully loaded.

Now you can get all the HTML information you want from this URL.

Locating Elements

There are different ways to locate elements with Selenium. Which is the best one, depends on the HTML structure of the page you are scraping. It can be tricky to figure out what is the most efficient way to access the element you want. So take your time and inspect the HTML carefully.

You can either access a single element with a chosen search parameter (you will get the first element that corresponds to your search parameter) or all the elements that match the search parameter. To get a single one use these methods:

find_element_by_id()
find_element_by_name()
find_element_by_xpath()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()

To locate multiple elements just substitute element with elements in the above methods. You will get a list of WebDriver objects located by this method.

Scraping Wikipedia

So let’s see how it works with the already mentioned Wikipedia page https://en.wikipedia.org/wiki/Main_Page

We have already created browser variable containing an instance of the WebDriver and loaded the main Wikipedia page.

Let’s say we want to access the list of languages that this page can be translated to and store all the links to them.

After some inspection we can see that all elements have a similar structure: they are <li> elements of class 'interlanguage-link' that contain <a> with a URL and text:

<li class="interlanguage-link interwiki-bg">

   <a href="https://bg.wikipedia.org/wiki/" title="Bulgarian"
   lang="bg" hreflang="bg" class="interlanguage-link-target">

       Български

   </a>

</li>

Enter fullscreen mode Exit fullscreen mode

So let’s first access all <li> elements. We can isolate them using class name:

languages = browser.find_elements_by_class_name('interlanguage-link')

languages is a list of WebDriver objects. If we print the first element of it with:

print(languages[0])

It will print something like this:

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="73e70f48-851a-764d-8533-66f738d2bcf6", element="2a579b98-1a03-b04f-afe3-5d3da8aa9ec1")>

So to actually see what’s inside, we will need to write a for loop to access each element from the list, then access it’s <a> child element and get <a>‘s text and 'href' attribute.

To get the text we can use text attribute. To get the 'href' use get_attribute('attribute_name') method. So the code will look like this:

language_names = [language.find_element_by_css_selector('a').text 
                 for language in languages]

links = [language.find_element_by_css_selector('a').get_attribute('href') 
        for language in languages]

Enter fullscreen mode Exit fullscreen mode

You can print out language_names and links to see that it worked.

Scrolling

Sometimes not the whole page is loaded from the start. In this case we can make the browser scroll down to get HTML from the rest of the page. It is quite easy with execute_script() method that takes JavaScript code as a parameter:

scroll_down = "window.scrollTo(0, document.body.scrollHeight);"
browser.execute_script(scroll_down)

Enter fullscreen mode Exit fullscreen mode

scrollTo(x-coord, y-coord) is a JavaScript method that scrolls to the given coordinates. In our case we are using document.body.scrollHeight which returns the height of the element (in this case body).

As you might have guessed, you can make the browser execute all kind of scripts with execute_script() method. So if you have experience with JavaScript, you have a lot of room to experiment.

Clicking

Clicking is as easy as selecting an element and applying click() method to it. In some cases if you know the URLs that you need to go to, you can make the browser load the page with URLs. Again, see what is more efficient.

To give an example of the click() method, let’s click on the ‘Contents’ link from the menu on the left.

The HTML of this link is the following:

<li id="n-contents">
   <a href="/wiki/Portal:Contents" title="Guides to browsing Wikipedia">

        Contents

   </a>
</li>

Enter fullscreen mode Exit fullscreen mode

We have to find the <li> element with the unique id 'n-contents' first and then access its <a> child

content_element = browser.find_element_by_id('n-contents') \
                         .find_element_by_css_selector('a')

content_element.click()

Enter fullscreen mode Exit fullscreen mode

You can see now that the browser loaded the ‘Contents’ page.

Downloading Images

Now what if we decide to download images from the page. For this we will use urllib library and a uuid generator. We will first locate all images with CSS selector 'img', then access its 'src' attribute, and then creating a unique id for each image download the images with urlretrieve('url', 'folder/name.jpg') method. This method takes 2 parameters: a URL of the image and a name we want to give it together with the folder we want to download to (if applicable).

from urllib.request import urlretrieve
from uuid import uuid4

# get the main page again
browser.get('https://en.wikipedia.org/wiki/Main_Page')

# locate image elements
images = browser.find_elements_by_css_selector('img')

# access src attribute of the images
src_list = [img.get_attribute('src') for img in images]


for src in src_list:
    # create a unique name for each image by using UUID generator
    uuid = uuid4()

    # retrieve umages using the URLs
    urlretrieve(src, f"wiki_images/{uuid}.jpg")

Enter fullscreen mode Exit fullscreen mode

Adding Waiting Time Between Actions

And lastly, sometimes it is necessary to introduce some waiting time between actions in the browser. For example, when loading a lot of pages one after another. It can be done with time module.

Let’s load 3 URLs from our links list and make the browser wait for 3 seconds before loading each page using time.sleep() method.

import time

urls = links[0:3]

for url in urls:
    browser.get(url)
    # stop for 3 seconds before going for the next page
    time.sleep(3)

Enter fullscreen mode Exit fullscreen mode

Closing the WebDriver

And finally we can close our robot browser’s window with

browser.close()

Don’t forget that browser is a variable that contains an instance of Firefox() method (see the beginning of the tutorial).

Code in GitHub

The code from this article is available in GitHub:
https://github.com/AnnaLara/scraping_with_selenium_basics

原文链接:Introduction to Web Scraping with Selenium And Python

© 版权声明
THE END
喜欢就支持一下吧
点赞8 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容