Web scraping is a fast, affordable and reliable way to get data when you need it. What is even better, the data is usually up-to-date. Now, bear in mind that when scraping a website, you might be violating its usage policy and can get kicked out of it. While scraping is mostly legal, there might be some exceptions depending on how you are going to use the data. So make sure you do your research before starting. For a simple personal or open-source project, however, you should be ok.
There are many ways to scrape data, but the one I prefer the most is to use Selenium. It is primarily used for testing as what it basically does is browser automation. In simple language, it creates a robot browser that does things for you: it can get HTML data, scroll, click buttons, etc. The great advantage is that we can tell specifically what HTML data we want so we can organize and store it appropriately.
Selenium is compatible with many programming languages, but this tutorial is going to focus on Python. Check this link to read Selenium (with Python) documentation.
First Steps
To download Selenium use this simple command in your command line:
pip install selenium
If you are working in a Jupyter Notebook, you can do it right there instead of the command line. Just add an exclamation mark in the beginning:
!pip install selenium
After that all you need to do is import the necessary modules:
from selenium.webdriver import Chrome, Firefox
Other browsers are also supported but these two are the most commonly used.
Two simple commands are needed to get started:
browser = Firefox()
(or browser = Chrome()
depending on your preference)
This creates an instance of a Firefox WebDriver that will allow us to access all its useful methods and attributes. We assigned it to the variable browser
but you are free to choose your own name. A new blank window of the Firefox browser will be automatically opened.
Next get the URL that you want to scrape:
browser.get('https://en.wikipedia.org/wiki/Main_Page')
The get()
method will open the URL in the browser and will wait until it is fully loaded.
Now you can get all the HTML information you want from this URL.
Locating Elements
There are different ways to locate elements with Selenium. Which is the best one, depends on the HTML structure of the page you are scraping. It can be tricky to figure out what is the most efficient way to access the element you want. So take your time and inspect the HTML carefully.
You can either access a single element with a chosen search parameter (you will get the first element that corresponds to your search parameter) or all the elements that match the search parameter. To get a single one use these methods:
find_element_by_id()
find_element_by_name()
find_element_by_xpath()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_selector()
To locate multiple elements just substitute element
with elements
in the above methods. You will get a list of WebDriver objects located by this method.
Scraping Wikipedia
So let’s see how it works with the already mentioned Wikipedia page https://en.wikipedia.org/wiki/Main_Page
We have already created browser
variable containing an instance of the WebDriver and loaded the main Wikipedia page.
Let’s say we want to access the list of languages that this page can be translated to and store all the links to them.
After some inspection we can see that all elements have a similar structure: they are <li>
elements of class 'interlanguage-link'
that contain <a>
with a URL and text:
<li class="interlanguage-link interwiki-bg">
<a href="https://bg.wikipedia.org/wiki/" title="Bulgarian"
lang="bg" hreflang="bg" class="interlanguage-link-target">
Български
</a>
</li>
Enter fullscreen mode Exit fullscreen mode
So let’s first access all <li>
elements. We can isolate them using class name:
languages = browser.find_elements_by_class_name('interlanguage-link')
languages
is a list of WebDriver objects. If we print the first element of it with:
print(languages[0])
It will print something like this:
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="73e70f48-851a-764d-8533-66f738d2bcf6", element="2a579b98-1a03-b04f-afe3-5d3da8aa9ec1")>
So to actually see what’s inside, we will need to write a for loop to access each element from the list, then access it’s <a>
child element and get <a>
‘s text and 'href'
attribute.
To get the text we can use text
attribute. To get the 'href'
use get_attribute('attribute_name')
method. So the code will look like this:
language_names = [language.find_element_by_css_selector('a').text
for language in languages]
links = [language.find_element_by_css_selector('a').get_attribute('href')
for language in languages]
Enter fullscreen mode Exit fullscreen mode
You can print out language_names
and links
to see that it worked.
Scrolling
Sometimes not the whole page is loaded from the start. In this case we can make the browser scroll down to get HTML from the rest of the page. It is quite easy with execute_script()
method that takes JavaScript code as a parameter:
scroll_down = "window.scrollTo(0, document.body.scrollHeight);"
browser.execute_script(scroll_down)
Enter fullscreen mode Exit fullscreen mode
scrollTo(x-coord, y-coord)
is a JavaScript method that scrolls to the given coordinates. In our case we are using document.body.scrollHeight
which returns the height of the element (in this case body
).
As you might have guessed, you can make the browser execute all kind of scripts with execute_script()
method. So if you have experience with JavaScript, you have a lot of room to experiment.
Clicking
Clicking is as easy as selecting an element and applying click()
method to it. In some cases if you know the URLs that you need to go to, you can make the browser load the page with URLs. Again, see what is more efficient.
To give an example of the click()
method, let’s click on the ‘Contents’ link from the menu on the left.
The HTML of this link is the following:
<li id="n-contents">
<a href="/wiki/Portal:Contents" title="Guides to browsing Wikipedia">
Contents
</a>
</li>
Enter fullscreen mode Exit fullscreen mode
We have to find the <li>
element with the unique id 'n-contents'
first and then access its <a>
child
content_element = browser.find_element_by_id('n-contents') \
.find_element_by_css_selector('a')
content_element.click()
Enter fullscreen mode Exit fullscreen mode
You can see now that the browser loaded the ‘Contents’ page.
Downloading Images
Now what if we decide to download images from the page. For this we will use urllib
library and a uuid generator. We will first locate all images with CSS selector 'img'
, then access its 'src'
attribute, and then creating a unique id for each image download the images with urlretrieve('url', 'folder/name.jpg')
method. This method takes 2 parameters: a URL of the image and a name we want to give it together with the folder we want to download to (if applicable).
from urllib.request import urlretrieve
from uuid import uuid4
# get the main page again
browser.get('https://en.wikipedia.org/wiki/Main_Page')
# locate image elements
images = browser.find_elements_by_css_selector('img')
# access src attribute of the images
src_list = [img.get_attribute('src') for img in images]
for src in src_list:
# create a unique name for each image by using UUID generator
uuid = uuid4()
# retrieve umages using the URLs
urlretrieve(src, f"wiki_images/{uuid}.jpg")
Enter fullscreen mode Exit fullscreen mode
Adding Waiting Time Between Actions
And lastly, sometimes it is necessary to introduce some waiting time between actions in the browser. For example, when loading a lot of pages one after another. It can be done with time
module.
Let’s load 3 URLs from our links
list and make the browser wait for 3 seconds before loading each page using time.sleep()
method.
import time
urls = links[0:3]
for url in urls:
browser.get(url)
# stop for 3 seconds before going for the next page
time.sleep(3)
Enter fullscreen mode Exit fullscreen mode
Closing the WebDriver
And finally we can close our robot browser’s window with
browser.close()
Don’t forget that browser
is a variable that contains an instance of Firefox()
method (see the beginning of the tutorial).
Code in GitHub
The code from this article is available in GitHub:
https://github.com/AnnaLara/scraping_with_selenium_basics
暂无评论内容