Basics of Scraping with Python

Prologue

Hello, in this post I am gonna describe the process of writing a scrapper script in Python, with the help of the Beautiful Soup library.

Installing the dependencies

First of all, since Beautiful Soup is a 3rd-party community project, you have to install it via the PyPI registry.

pip install beautifulsoup4

Enter fullscreen mode Exit fullscreen mode

Philosophy of Beautiful Soup

BS is a library that sits atop an HTML/XML parser (in our case it’s the prior)

Basic Script

Now that we know how it works, let’s write a tiny script:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests


WEBSITE = "https://google.com"


html = urlopen(WEBSITE)
bs = BeautifulSoup(html.read(), 'html.parser')

Enter fullscreen mode Exit fullscreen mode

In this example, we also make use of the urllib requests library, this just downloads the HTML for us.
Then, we read it with the pre-declared html variable that contains the google.com document

Parsing data

Sometimes, we want to get specific parts of a document, such as a paragraph or an image.

You can search for a specific HTML tag in BeautifulSoup with the find() attribute.

Let’s scrape the Google logo tag from their homepage!
Add the following lines of code to the already existing file:

google_logo = bs.find('img', { 'id': 'hplogo' })
print(google_logo)

Enter fullscreen mode Exit fullscreen mode

This two lines of code will hopefully produce this output:

<img 
alt="Google" 
height="92" 
id="hplogo" 
src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png"
style="padding:28px 0 14px" 
width="272"/>

Enter fullscreen mode Exit fullscreen mode

So, how does this work?
Well, we are using the find() method and passing to it some arguments.
To be exact, we are telling it that we are searching for an <img> tag with an id called 'hplogo'

Epilogue

That’s all.
To learn more about Beautiful Soup, read the docs

原文链接:Basics of Scraping with Python

© 版权声明
THE END
喜欢就支持一下吧
点赞9 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容