Basics of Scraping with Python

Prologue

Hello, in this post I am gonna describe the process of writing a scrapper script in Python, with the help of the Beautiful Soup library.

Installing the dependencies

First of all, since Beautiful Soup is a 3rd-party community project, you have to install it via the PyPI registry.

pip install beautifulsoup4

Enter fullscreen mode Exit fullscreen mode

Philosophy of Beautiful Soup

BS is a library that sits atop an HTML/XML parser (in our case it’s the prior)

Basic Script

Now that we know how it works, let’s write a tiny script:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests


WEBSITE = "https://google.com"


html = urlopen(WEBSITE)
bs = BeautifulSoup(html.read(), 'html.parser')

Enter fullscreen mode Exit fullscreen mode

In this example, we also make use of the urllib requests library, this just downloads the HTML for us.
Then, we read it with the pre-declared html variable that contains the google.com document

Parsing data

Sometimes, we want to get specific parts of a document, such as a paragraph or an image.

You can search for a specific HTML tag in BeautifulSoup with the find() attribute.

Let’s scrape the Google logo tag from their homepage!
Add the following lines of code to the already existing file:

google_logo = bs.find('img', { 'id': 'hplogo' })
print(google_logo)

Enter fullscreen mode Exit fullscreen mode

This two lines of code will hopefully produce this output:

<img 
alt="Google" 
height="92" 
id="hplogo" 
src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png"
style="padding:28px 0 14px" 
width="272"/>

Enter fullscreen mode Exit fullscreen mode

So, how does this work?
Well, we are using the find() method and passing to it some arguments.
To be exact, we are telling it that we are searching for an <img> tag with an id called 'hplogo'

Epilogue

That’s all.
To learn more about Beautiful Soup, read the docs

原文链接：Basics of Scraping with Python

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END

Python（EN）
# python # webdev

Basics of Scraping with Python

Prologue

Installing the dependencies

Philosophy of Beautiful Soup

Basic Script

Parsing data

Epilogue

请登录后发表评论