Today we are going to look an awesome python module, scrapping is fun if you try it previously. Scrapping & Crawling are common names, but they have a bit of difference. Web Crawling is basically what Google & Facebook etc do, it’s looking for any information. On the other hand, Scrapping is targeted at certain websites, for specific data, e.g. for product information and price, etc.
Check Development Environment Ready or not
Before moving forward we need to check python is available or not. To do so, Open terminal or command line and hit below command,
python --versionOutput: Python 2.7.16Or,python3 --versionOutput: Python 3.8.0python --version Output: Python 2.7.16 Or, python3 --version Output: Python 3.8.0python --version Output: Python 2.7.16 Or, python3 --version Output: Python 3.8.0
Enter fullscreen mode Exit fullscreen mode
If everything looks good like me, your python version might be different from me. So don’t worry about it. If you see not found then install python from here.
Setup Virtual Environment
We need to create Virtual Environment, to avoid python modules, dependency or libraries version conflicting issues. So that we can ensure isolation, each project dependencies or libraries version can be maintained easily.
Open terminal or command line then create a project
macOS Users:-
pip install virtualenvpython3 -m virtualenv venvsource venv/bin/activatepip install virtualenv python3 -m virtualenv venv source venv/bin/activatepip install virtualenv python3 -m virtualenv venv source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode
Windows Users:-
pip install virtualenvvirtualenv venvsrouce venv\Scripts\activatepip install virtualenv virtualenv venv srouce venv\Scripts\activatepip install virtualenv virtualenv venv srouce venv\Scripts\activate
Enter fullscreen mode Exit fullscreen mode
We can see, venv
folder will be created. Congratulation successfully we are able to create Virtual Environment
Install required libs or modules
Open Terminal or command line then hit bellow commands,
pip install beautifulsoup4pip install cfscrapepip install beautifulsoup4 pip install cfscrapepip install beautifulsoup4 pip install cfscrape
Enter fullscreen mode Exit fullscreen mode
Learn Basic How Scrapping Work
Create app.py
file, includes
<span>import</span> <span>cfscrape</span><span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span><span>def</span> <span>basic</span><span>():</span><span># string html code sample </span> <span>html_text</span> <span>=</span> <span>''' <div> <h1 class="product-name">Product Name 1</h1> <h1 custom-attr="price" class="product-price">100</h1> <p class="product description">This is basic description 1</p> </div> <div> <h1 class="product-name">Product Name 2</h1> <h1 custom-attr="price" class="product-price">200</h1> <p class="product description">This is basic description 2</p> </div> '''</span><span>parsed_html</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html_text</span><span>,</span> <span>'html.parser'</span><span>)</span> <span># String to HTML </span> <span># parsed_html = BeautifulSoup("https://www.google.com", 'html.parser') # URL to HTML </span> <span># parsed_html = BeautifulSoup(open("from_file.html"), 'html.parser') # File to HTML </span><span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product-name"</span><span>)[</span><span>0</span><span>].</span><span>text</span><span>)</span><span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product-name"</span><span>)[</span><span>1</span><span>].</span><span>text</span><span>)</span><span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product.description"</span><span>)[</span><span>0</span><span>].</span><span>text</span><span>)</span><span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>findAll</span><span>(</span><span>"h1"</span><span>,</span> <span>{</span><span>"custom-attr"</span><span>:</span> <span>"price"</span><span>})[</span><span>0</span><span>].</span><span>text</span><span>)</span><span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>find</span><span>(</span><span>"h1"</span><span>,</span> <span>{</span><span>"custom-attr"</span><span>:</span> <span>"price"</span><span>}).</span><span>text</span><span>)</span><span>if</span> <span>__name__</span> <span>==</span> <span>'__main__'</span><span>:</span><span>basic</span><span>()</span><span>import</span> <span>cfscrape</span> <span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span> <span>def</span> <span>basic</span><span>():</span> <span># string html code sample </span> <span>html_text</span> <span>=</span> <span>''' <div> <h1 class="product-name">Product Name 1</h1> <h1 custom-attr="price" class="product-price">100</h1> <p class="product description">This is basic description 1</p> </div> <div> <h1 class="product-name">Product Name 2</h1> <h1 custom-attr="price" class="product-price">200</h1> <p class="product description">This is basic description 2</p> </div> '''</span> <span>parsed_html</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html_text</span><span>,</span> <span>'html.parser'</span><span>)</span> <span># String to HTML </span> <span># parsed_html = BeautifulSoup("https://www.google.com", 'html.parser') # URL to HTML </span> <span># parsed_html = BeautifulSoup(open("from_file.html"), 'html.parser') # File to HTML </span> <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product-name"</span><span>)[</span><span>0</span><span>].</span><span>text</span><span>)</span> <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product-name"</span><span>)[</span><span>1</span><span>].</span><span>text</span><span>)</span> <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product.description"</span><span>)[</span><span>0</span><span>].</span><span>text</span><span>)</span> <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>findAll</span><span>(</span><span>"h1"</span><span>,</span> <span>{</span><span>"custom-attr"</span><span>:</span> <span>"price"</span><span>})[</span><span>0</span><span>].</span><span>text</span><span>)</span> <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>find</span><span>(</span><span>"h1"</span><span>,</span> <span>{</span><span>"custom-attr"</span><span>:</span> <span>"price"</span><span>}).</span><span>text</span><span>)</span> <span>if</span> <span>__name__</span> <span>==</span> <span>'__main__'</span><span>:</span> <span>basic</span><span>()</span>import cfscrape from bs4 import BeautifulSoup def basic(): # string html code sample html_text = ''' <div> <h1 class="product-name">Product Name 1</h1> <h1 custom-attr="price" class="product-price">100</h1> <p class="product description">This is basic description 1</p> </div> <div> <h1 class="product-name">Product Name 2</h1> <h1 custom-attr="price" class="product-price">200</h1> <p class="product description">This is basic description 2</p> </div> ''' parsed_html = BeautifulSoup(html_text, 'html.parser') # String to HTML # parsed_html = BeautifulSoup("https://www.google.com", 'html.parser') # URL to HTML # parsed_html = BeautifulSoup(open("from_file.html"), 'html.parser') # File to HTML print(parsed_html.select(".product-name")[0].text) print(parsed_html.select(".product-name")[1].text) print(parsed_html.select(".product.description")[0].text) print(parsed_html.findAll("h1", {"custom-attr": "price"})[0].text) print(parsed_html.find("h1", {"custom-attr": "price"}).text) if __name__ == '__main__': basic()
Enter fullscreen mode Exit fullscreen mode
Now, open a terminal and hit below command, python app.py
to run that file.
Learn Anti Bot Scraping
Create app.py
file, includes
<span>def</span> <span>anti_bot_scraping</span><span>():</span><span>target_url</span> <span>=</span> <span>"https://www.google.com"</span> <span># replace url with anti-bot protected website </span> <span>scraper</span> <span>=</span> <span>cfscrape</span><span>.</span><span>create_scraper</span><span>()</span><span>html_text</span> <span>=</span> <span>scraper</span><span>.</span><span>get</span><span>(</span><span>target_url</span><span>).</span><span>text</span><span>parsed_html</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html_text</span><span>,</span> <span>'html.parser'</span><span>)</span><span>print</span><span>(</span><span>parsed_html</span><span>)</span><span>if</span> <span>__name__</span> <span>==</span> <span>'__main__'</span><span>:</span><span>anti_bot_scraping</span><span>()</span><span>def</span> <span>anti_bot_scraping</span><span>():</span> <span>target_url</span> <span>=</span> <span>"https://www.google.com"</span> <span># replace url with anti-bot protected website </span> <span>scraper</span> <span>=</span> <span>cfscrape</span><span>.</span><span>create_scraper</span><span>()</span> <span>html_text</span> <span>=</span> <span>scraper</span><span>.</span><span>get</span><span>(</span><span>target_url</span><span>).</span><span>text</span> <span>parsed_html</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html_text</span><span>,</span> <span>'html.parser'</span><span>)</span> <span>print</span><span>(</span><span>parsed_html</span><span>)</span> <span>if</span> <span>__name__</span> <span>==</span> <span>'__main__'</span><span>:</span> <span>anti_bot_scraping</span><span>()</span>def anti_bot_scraping(): target_url = "https://www.google.com" # replace url with anti-bot protected website scraper = cfscrape.create_scraper() html_text = scraper.get(target_url).text parsed_html = BeautifulSoup(html_text, 'html.parser') print(parsed_html) if __name__ == '__main__': anti_bot_scraping()
Enter fullscreen mode Exit fullscreen mode
Now, open a terminal and hit below command, python app.py
to run that file.
Notes: Please don't misuse this knowledge. I'm sharing it only for learning purposes or fun purposes.
Enjoy, Coding!
Congratulations!. & Thank You!
Feel free to comments, If you have any issues & queries.
暂无评论内容