Prologue
Hello, in this post I am gonna describe the process of writing a scrapper script in Python, with the help of the Beautiful Soup library.
Installing the dependencies
First of all, since Beautiful Soup is a 3rd-party community project, you have to install it via the PyPI registry.
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
Philosophy of Beautiful Soup
BS is a library that sits atop an HTML/XML parser (in our case it’s the prior)
Basic Script
Now that we know how it works, let’s write a tiny script:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
WEBSITE = "https://google.com"
html = urlopen(WEBSITE)
bs = BeautifulSoup(html.read(), 'html.parser')
Enter fullscreen mode Exit fullscreen mode
In this example, we also make use of the urllib
requests library, this just downloads the HTML for us.
Then, we read it with the pre-declared html
variable that contains the google.com document
Parsing data
Sometimes, we want to get specific parts of a document, such as a paragraph or an image.
You can search for a specific HTML tag in BeautifulSoup with the find() attribute.
Let’s scrape the Google logo tag from their homepage!
Add the following lines of code to the already existing file:
google_logo = bs.find('img', { 'id': 'hplogo' })
print(google_logo)
Enter fullscreen mode Exit fullscreen mode
This two lines of code will hopefully produce this output:
<img
alt="Google"
height="92"
id="hplogo"
src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png"
style="padding:28px 0 14px"
width="272"/>
Enter fullscreen mode Exit fullscreen mode
So, how does this work?
Well, we are using the find() method and passing to it some arguments.
To be exact, we are telling it that we are searching for an <img>
tag with an id called 'hplogo'
Epilogue
That’s all.
To learn more about Beautiful Soup, read the docs
暂无评论内容