HTML Parser - Extract HTML information with ease - 拾光赋-拾光赋

HTML Parser – Extract HTML information with ease

6年前发布

04610

Hello Coder,

This article presents a few practical code snippets to extract and process HTML information using an HTML Parser Dev Tool written in Python / BS4 library.

Thanks for reading! – Content provided by App Generator.

Following topics will be covered:

Load the Html
Scan the file for assets: images, Javascript files, CSS files
Change the path of an existing asset
Update existing elements: change the src attribute of an image
Locate an element based on the id
Remove an element from the DOM tree
Process an existing component: remove hardcoded text
Save the processed HTML to a file

HTML Parsing Concept

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here means to load the HTML, extract and process the relevant information like head title, page assets, main sections and later on, save the processed file.

Parser Environment

The code uses BeautifulSoup library, the well-known parsing library written in Python. To start coding, we need a few modules installed on our system.

$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here

Enter fullscreen mode Exit fullscreen mode

Load the HTML content

The file will be loaded as any other file, and the content should be injected into a BeautifulSoup object

from bs4 import BeautifulSoup as bs

# Load the HTML content html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close() # clean up 
# Initialize the BS object soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML # elements stored in memory using all helpers offered by BS library

Enter fullscreen mode Exit fullscreen mode

Parse the HTML for assets

At this point, we have the DOM tree loaded in the BeautifulSoup object. Let’s scan the DOM tree for Javascript files, the script nodes:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

Enter fullscreen mode Exit fullscreen mode

The code snippet that locates the Javascript has only a few lines of code. The BS library will return an array of objects and we can mutate each script node with ease:

for script in soup.body.find_all('script', recursive=False):

   # Print the src attribute    print(' JS source = ' + script['src'])

   # Print the type attribute    print(' JS type = ' + script['type'])

Enter fullscreen mode Exit fullscreen mode

In a similar way, we can select and process the CSS nodes:

...
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/app.css">
...

Enter fullscreen mode Exit fullscreen mode

And the code ..

for link in soup.find_all('link'):

   # Print the src attribute    print(' CSS file = ' + script['href'])

Enter fullscreen mode Exit fullscreen mode

Parse the HTML for images

In this code snippet, we will mutate the node and change the src attribute of the image node

...
<img src="images/pic01.jpg" alt="Bred Pitt">
...

Enter fullscreen mode Exit fullscreen mode

for img in soup.body.find_all('img'):

   # Print the path    print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1] # extract the last segment, aka image file    img[src] = '/assets/img/' + img_file 
   # the new path is set

Enter fullscreen mode Exit fullscreen mode

Locate an element based on the ID

This can be achieved by a single line of code. Let’s imagine that we have an element (div or span) with the id 1234:

...
<div id="1234" class="handsome">
Some text
</div>

Enter fullscreen mode Exit fullscreen mode

And the code:

mydiv = soup.find("div", {"id": "1234"})

print(mydiv) 

# delete the element mydiv.decompose()

Enter fullscreen mode Exit fullscreen mode

Remove the hard-coded text

This code snippet is useful for components extraction and translation to different template engines. Let’s imagine that we have this simple component:

<div id="1234" class="cool">
   <span>Html Parsing</span>
   <span>the practical guide</span> 
</div>

Enter fullscreen mode Exit fullscreen mode

If we want to use this component in Php, the component becomes:

<div id="1234" class="cool">
   <span><?php echo $title ?></span>
   <span><?php echo $info ?></span> 
</div>

Enter fullscreen mode Exit fullscreen mode

Or for the Jinja2 (Python template engine)

<div id="1234" class="cool">
   <span>{{ title }}</span>
   <span>{{ info }}</span> 
</div>

Enter fullscreen mode Exit fullscreen mode

To void the manual work, we can use a code snippet to replace automatically the hardcoded texts and prepare the component for a specific template engine:

# locate the div mydiv = soup.find("div", {"id": "1234"})

print(mydiv) # print before processing 
# iterate on div elements for tag in mydiv.descendants:

   # NavigableString is the text inside the tag,    # not the tag himself    if not isinstance(tag, NavigableString):

      print( 'Found tag = ' + tag.name ' -> ' + tag.text )
      # this will print:       # Found tag = span -> Html Parsing       # Found tag = span -> the practical guide 
      # replace the text for Php       tag.text = '<?php echo $title ?>'

      # replace the text for Jinja       tag.text = '{{ title }}'

Enter fullscreen mode Exit fullscreen mode

To use the component, we can save the component to a file:


# mydiv is the processed component php_component is the string representation
php_component = mydiv.prettify(formatter="html") 

file = open( 'component.php', 'w+') 
file.write( php_component )
file.close()

Enter fullscreen mode Exit fullscreen mode

At this point, the original div is extracted from the DOM, with hard-coded texts removed, and ready to be used in a Php or Python project.

Save the new HTML

Now we have the mutated DOM in a BeautifulSoup object, in memory. To save the content to a new file, we need to call the prettify() and save the content to a new HTML file.


new_dom_content = soup.prettify(formatter="html") 

file = open( 'index_parsed.html', 'w+') 
file.write( new_dom_content )
file.close()

Enter fullscreen mode Exit fullscreen mode

HTML Parser – Use Cases

I’m using HTML parsing quite a lot, especially for tasks where manually work is involved:

process HTML themes to be used in a new project
extract hard-coded texts and extract components
translate flat HTML themes to Jinja, Mustache or PUG templates

From time to time, I’m publishing free samples in this public repository.

Resources

HTML Parser – supported by AppSeed
HTML Parser – How to use Python BS4 to work less
Developer Tools – Open-Source HTML Parser – related article
BeautifulSoup Html Parser documentation
HTML Parser – Convert HTML to Jinja2 and Php components – related blog article

Thanks! For more resources and tools, feel free to access:

原文链接：HTML Parser – Extract HTML information with ease

© 版权声明

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END

Python（EN）
# python # tools # appseed # htmlparser

喜欢就支持一下吧

相关推荐

评论抢沙发

请登录后发表评论

暂无评论内容