What is web scraping?
Web scraping is a term used for the process of extracting HTML/XML data from websites. Once extracted, it can be parsed into a different HTML file or saved locally in text/spreadsheet documents.
Who does it?
A lot of websites that aggregate data from other websites on the internet. Some examples could be websites that give you the best deals on the same product after comparing across multiple platforms (Amazon, Flipkart, Ebay, etc.), and also sites that collect datasets to apply ML algorithms to.
How is it useful to me?
I would recommend you to limit your thinking to how something could benefit you especially when you know little to nothing about it. It helps to be a generalist when you’re just starting out. Learn everything, you never know when you’ll need it! You can always settle and specialize in one area eventually, when you’re well aware of the options you have.
What we’ll need
- Python v3.6.8
- VSCode
Installing Python (skip if already installed)
- Go to — python.org > Downloads > Windows
- Scroll to version 3.6.8 > x86 (32 bit) / x86–64 (64 bit) > Executable
- Double-click and check “Add Python to PATH”
- Follow the installation instructions.
- Check if correctly installed
- Press Windows key+R > Type “cmd” to open the command line.
- In the command line > Type,
- python –version
If Python is installed correctly, you should see, 3.6.8
in the terminal.
Installing VSCode (skip if already installed)
VSCode is a free code editor with lots of features that make writing and debugging code much easier.
- Go to code.visualstudio.com > Download for Windows > x86/x64 > Installer.
- Double-click and follow the instructions.
Let’s begin!
- Create a new folder and call it “Webscraper”
- Inside the folder, create a new file named webscraper.py
- Open VSCode > File > Open Folder > Navigate to “Webscraper”
Now we need to import a few libraries which will help us build our web scraper.
- Go to Terminal > New Terminal
This is basically the command line but within the editor so we don’t have to have two windows and keep switching between them.
- Next we call
pip
You could call it the Alfred to Python’s Batman. Hehe.
- In your terminal, type
pip install beautifulsoup4
This installs the beautifulsoup library which will help us scrape webpages.
- Next type
pip install flask
andpip install requests
Flask is a lightweight framework to build websites. We’ll use this to parse our collected data and display it as HTML in a new HTML file.
The requests module allows us to send http requests to the website we want to scrape.
In your file, type the following code:
from flask import Flask, render_template
from bs4 import BeautifulSoup
import requests
Enter fullscreen mode Exit fullscreen mode
The first line imports the Flask class and the render_template method from the flask library. The second line imports the BeautifulSoup class, and the third line imports the requests module from our Python library.
Next, we declare a variable which will hold the result of our request
source = requests.get(‘https://webscraper.droppages.com/').text
Enter fullscreen mode Exit fullscreen mode
We send a GET
request to https://webscraper.droppages.com and convert the HTML to plain text and store that in the source
variable.
Next we declare a soup
variable and store the value we get after passing source
to BS
. ‘lxml’ is the markup we want our rendered code to have.
soup = BeautifulSoup(source, 'lxml')
Enter fullscreen mode Exit fullscreen mode
At this point, we have our code working. You can check it out by passing soup to a print function, like this print(soup)
after the previous line and running python webscraper.py
in the terminal.
Right now, we are pulling the entire web page rather than specific elements on it. To get specific elements, you can try these by yourself.
But before you do that, you should be aware of what exactly you want to get. You can either run the last command again or open the web page in the browser and inspect it by right clicking on the page. Some knowledge of HTML DOM and CSS is required here. You can head over to W3Schools or MDN for a quick crash course on both.
<variable> = soup.find('<HTML_element_name>')
<variable> = soup.find('<HTML_element_name>').select_one('child_element')
<variable> = soup.find('<HTML_element_name>').find_all('child_element')
Enter fullscreen mode Exit fullscreen mode
You can pass regular CSS notation in the brackets to be more specific about the elements you want.
Now, we are only actually just outputting HTML along with the text inside it. What if we just want the text?
That’s easy.
We simply pass .text
at the end of it. Just like we did with source
. Here’s an example.
head = soup.find(‘main’).select_one(‘article:nth-of-type(4)’).div.text
Enter fullscreen mode Exit fullscreen mode
Here, we tell Python to store the text of the div in the 4th article element which is in the main element, in the head variable.
You can check the output by passing head to print()
and running python webscraper.py
in the terminal.
Try getting the names of one of the authors if you can.
You can get an author like this,
author = soup.find(‘main’).select_one(‘p’).text
Enter fullscreen mode Exit fullscreen mode
Notice how you also get the date along with the name. That’s because both of them share the same element. There is a way to get the author name separately by using Python string methods like split
and slice
. But we won’t cover that here.
Next up, we will use flask to re-render our received data the way we want on a local server.
In your file, type the following code,
app = Flask(__name__)
@app.route('/')
def index():
return render_template('index.html,**locals())
app.run(debug=True)
Enter fullscreen mode Exit fullscreen mode
Create a new templates folder in your main webscraper folder and call it index.html
The flask part is a little complicated to explain but to put it simply, we created a simple server that will take our index.html
from the templates folder and serve it on a local server — localhost://5000
Now, we can combine multiple variables we declared in all the previous code using soup and pass the text to our HTML and use CSS to style them the way we want!
You can use this code for the index.html
file,
<!DOCTYPE html>
<html lang=”en”>
<head>
<meta charset=”UTF-8">
<meta name=”viewport” content=”width=device-width, initial-scale=1.0">
<meta http-equiv=”X-UA-Compatible” content=”ie=edge”>
<title>Webscraper in Python using Flask</title>
</head>
<body>
<!-- Variables from Flask here -->
</body>
</html>
Enter fullscreen mode Exit fullscreen mode
Now, we can use all the code we learned so far to create custom variables and pull specific data from our site. If we are well versed with the structure of our target site, we can use shortcuts like these.
head = soup.header.h1.text
second_author = soup.main.select_one(‘article:nth-of-type(2)’).p.text
first_article = soup.main.article.div
Enter fullscreen mode Exit fullscreen mode
- Type these inside the
index()
function that we created, just above the return statement. - Save the file
- Go to index.html
Now we’ll pass these variables into our HTML while it gets rendered so we can see the data on our webpage.
<!DOCTYPE html>
<html lang=”en”>
<head>
<meta charset=”UTF-8">
<meta name=”viewport” content=”width=device-width, initial-scale=1.0">
<meta http-equiv=”X-UA-Compatible” content=”ie=edge”>
<title>Webscraper in Python using Flask</title>
</head>
<body>
<h1>{{ head }}</div>
<p>{{ second_author }}</p>
<article>{{ first_article }}</article>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode
Now open the terminal and run python webscraper.py
Aand we did it!
If you’re wondering how it’s so easy, well, it’s not. This was just a single page, and a simple one at that, with no classes or IDs added to the HTML. But this is a good start.
Wonder how you can scrape multiple pages?
The answer is multiple for
, while
, try
, except
and if-else
loops!
Hello, this was my very first technical article. If you find any errors in the code or the way I approached the tutorial, feel free to correct me. I’m excited to be part of this community as I grow with it and intend to contribute meaningful content. Thank you for reading!
暂无评论内容