HTML Parser – Extract HTML information with ease

Hello Coder,

This article presents a few practical code snippets to extract and process HTML information using an HTML Parser Dev Tool written in Python / BS4 library.

Thanks for reading! – Content provided by App Generator.


Following topics will be covered:

  • Load the Html
  • Scan the file for assets: images, Javascript files, CSS files
  • Change the path of an existing asset
  • Update existing elements: change the src attribute of an image
  • Locate an element based on the id
  • Remove an element from the DOM tree
  • Process an existing component: remove hardcoded text
  • Save the processed HTML to a file

HTML Parsing Concept

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here means to load the HTML, extract and process the relevant information like head title, page assets, main sections and later on, save the processed file.


Parser Environment

The code uses BeautifulSoup library, the well-known parsing library written in Python. To start coding, we need a few modules installed on our system.

$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here 

Enter fullscreen mode Exit fullscreen mode


Load the HTML content

The file will be loaded as any other file, and the content should be injected into a BeautifulSoup object

from bs4 import BeautifulSoup as bs

# Load the HTML content html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close() # clean up 
# Initialize the BS object soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML # elements stored in memory using all helpers offered by BS library 

Enter fullscreen mode Exit fullscreen mode


Parse the HTML for assets

At this point, we have the DOM tree loaded in the BeautifulSoup object. Let’s scan the DOM tree for Javascript files, the script nodes:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

Enter fullscreen mode Exit fullscreen mode

The code snippet that locates the Javascript has only a few lines of code. The BS library will return an array of objects and we can mutate each script node with ease:

for script in soup.body.find_all('script', recursive=False):

   # Print the src attribute    print(' JS source = ' + script['src'])

   # Print the type attribute    print(' JS type = ' + script['type'])   

Enter fullscreen mode Exit fullscreen mode

In a similar way, we can select and process the CSS nodes:

...
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/app.css">
...

Enter fullscreen mode Exit fullscreen mode

And the code ..

for link in soup.find_all('link'):

   # Print the src attribute    print(' CSS file = ' + script['href'])

Enter fullscreen mode Exit fullscreen mode


Parse the HTML for images

In this code snippet, we will mutate the node and change the src attribute of the image node

...
<img src="images/pic01.jpg" alt="Bred Pitt">
...

Enter fullscreen mode Exit fullscreen mode

for img in soup.body.find_all('img'):

   # Print the path    print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1] # extract the last segment, aka image file    img[src] = '/assets/img/' + img_file 
   # the new path is set 

Enter fullscreen mode Exit fullscreen mode


Locate an element based on the ID

This can be achieved by a single line of code. Let’s imagine that we have an element (div or span) with the id 1234:

...
<div id="1234" class="handsome">
Some text
</div>

Enter fullscreen mode Exit fullscreen mode

And the code:

mydiv = soup.find("div", {"id": "1234"})

print(mydiv) 

# delete the element mydiv.decompose()

Enter fullscreen mode Exit fullscreen mode


Remove the hard-coded text

This code snippet is useful for components extraction and translation to different template engines. Let’s imagine that we have this simple component:

<div id="1234" class="cool">
   <span>Html Parsing</span>
   <span>the practical guide</span> 
</div>

Enter fullscreen mode Exit fullscreen mode

If we want to use this component in Php, the component becomes:

<div id="1234" class="cool">
   <span><?php echo $title ?></span>
   <span><?php echo $info ?></span> 
</div>

Enter fullscreen mode Exit fullscreen mode

Or for the Jinja2 (Python template engine)

<div id="1234" class="cool">
   <span>{{ title }}</span>
   <span>{{ info }}</span> 
</div>

Enter fullscreen mode Exit fullscreen mode

To void the manual work, we can use a code snippet to replace automatically the hardcoded texts and prepare the component for a specific template engine:

# locate the div mydiv = soup.find("div", {"id": "1234"})

print(mydiv) # print before processing 
# iterate on div elements for tag in mydiv.descendants:

   # NavigableString is the text inside the tag,    # not the tag himself    if not isinstance(tag, NavigableString):

      print( 'Found tag = ' + tag.name ' -> ' + tag.text )
      # this will print:       # Found tag = span -> Html Parsing       # Found tag = span -> the practical guide 
      # replace the text for Php       tag.text = '<?php echo $title ?>'

      # replace the text for Jinja       tag.text = '{{ title }}'    

Enter fullscreen mode Exit fullscreen mode

To use the component, we can save the component to a file:


# mydiv is the processed component php_component is the string representation
php_component = mydiv.prettify(formatter="html") 

file = open( 'component.php', 'w+') 
file.write( php_component )
file.close()

Enter fullscreen mode Exit fullscreen mode

At this point, the original div is extracted from the DOM, with hard-coded texts removed, and ready to be used in a Php or Python project.


Save the new HTML

Now we have the mutated DOM in a BeautifulSoup object, in memory. To save the content to a new file, we need to call the prettify() and save the content to a new HTML file.


new_dom_content = soup.prettify(formatter="html") 

file = open( 'index_parsed.html', 'w+') 
file.write( new_dom_content )
file.close()

Enter fullscreen mode Exit fullscreen mode


HTML Parser – Use Cases

I’m using HTML parsing quite a lot, especially for tasks where manually work is involved:

  • process HTML themes to be used in a new project
  • extract hard-coded texts and extract components
  • translate flat HTML themes to Jinja, Mustache or PUG templates

From time to time, I’m publishing free samples in this public repository.

Resources


Thanks! For more resources and tools, feel free to access:

原文链接:HTML Parser – Extract HTML information with ease

© 版权声明
THE END
喜欢就支持一下吧
点赞10 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容