Python bypass anti-bot page and scrape it

Today we are going to look an awesome python module, scrapping is fun if you try it previously. Scrapping & Crawling are common names, but they have a bit of difference. Web Crawling is basically what Google & Facebook etc do, it’s looking for any information. On the other hand, Scrapping is targeted at certain websites, for specific data, e.g. for product information and price, etc.

Check Development Environment Ready or not

Before moving forward we need to check python is available or not. To do so, Open terminal or command line and hit below command,


python --version
Output: Python 2.7.16
Or,
python3 --version
Output: Python 3.8.0
python --version
Output: Python 2.7.16

Or,

python3 --version
Output: Python 3.8.0
python --version
Output: Python 2.7.16

Or,

python3 --version
Output: Python 3.8.0

Enter fullscreen mode Exit fullscreen mode

If everything looks good like me, your python version might be different from me. So don’t worry about it. If you see not found then install python from here.

Setup Virtual Environment

We need to create Virtual Environment, to avoid python modules, dependency or libraries version conflicting issues. So that we can ensure isolation, each project dependencies or libraries version can be maintained easily.

Open terminal or command line then create a project

macOS Users:-


pip install virtualenv
python3 -m virtualenv venv
source venv/bin/activate
pip install virtualenv
python3 -m virtualenv venv
source venv/bin/activate
pip install virtualenv
python3 -m virtualenv venv
source venv/bin/activate

Enter fullscreen mode Exit fullscreen mode

Windows Users:-


pip install virtualenv
virtualenv venv
srouce venv\Scripts\activate
pip install virtualenv
virtualenv venv
srouce venv\Scripts\activate
pip install virtualenv
virtualenv venv
srouce venv\Scripts\activate

Enter fullscreen mode Exit fullscreen mode

We can see, venv folder will be created. Congratulation successfully we are able to create Virtual Environment

Install required libs or modules

Open Terminal or command line then hit bellow commands,


pip install beautifulsoup4
pip install cfscrape
pip install beautifulsoup4
pip install cfscrape
pip install beautifulsoup4
pip install cfscrape

Enter fullscreen mode Exit fullscreen mode

Learn Basic How Scrapping Work

Create app.py file, includes


<span>import</span> <span>cfscrape</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>
<span>def</span> <span>basic</span><span>():</span>
    <span># string html code sample </span>    <span>html_text</span> <span>=</span> <span>''' <div> <h1 class="product-name">Product Name 1</h1> <h1 custom-attr="price" class="product-price">100</h1> <p class="product description">This is basic description 1</p> </div> <div> <h1 class="product-name">Product Name 2</h1> <h1 custom-attr="price" class="product-price">200</h1> <p class="product description">This is basic description 2</p> </div> '''</span>
    <span>parsed_html</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html_text</span><span>,</span> <span>'html.parser'</span><span>)</span>                       <span># String to HTML </span>    <span># parsed_html = BeautifulSoup("https://www.google.com", 'html.parser') # URL to HTML </span>    <span># parsed_html = BeautifulSoup(open("from_file.html"), 'html.parser') # File to HTML </span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product-name"</span><span>)[</span><span>0</span><span>].</span><span>text</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product-name"</span><span>)[</span><span>1</span><span>].</span><span>text</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product.description"</span><span>)[</span><span>0</span><span>].</span><span>text</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>findAll</span><span>(</span><span>"h1"</span><span>,</span> <span>{</span><span>"custom-attr"</span><span>:</span> <span>"price"</span><span>})[</span><span>0</span><span>].</span><span>text</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>find</span><span>(</span><span>"h1"</span><span>,</span> <span>{</span><span>"custom-attr"</span><span>:</span> <span>"price"</span><span>}).</span><span>text</span><span>)</span>
<span>if</span> <span>__name__</span> <span>==</span> <span>'__main__'</span><span>:</span>
    <span>basic</span><span>()</span>
<span>import</span> <span>cfscrape</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>

<span>def</span> <span>basic</span><span>():</span>
    <span># string html code sample </span>    <span>html_text</span> <span>=</span> <span>''' <div> <h1 class="product-name">Product Name 1</h1> <h1 custom-attr="price" class="product-price">100</h1> <p class="product description">This is basic description 1</p> </div> <div> <h1 class="product-name">Product Name 2</h1> <h1 custom-attr="price" class="product-price">200</h1> <p class="product description">This is basic description 2</p> </div> '''</span>
    <span>parsed_html</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html_text</span><span>,</span> <span>'html.parser'</span><span>)</span>                       <span># String to HTML </span>    <span># parsed_html = BeautifulSoup("https://www.google.com", 'html.parser') # URL to HTML </span>    <span># parsed_html = BeautifulSoup(open("from_file.html"), 'html.parser') # File to HTML </span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product-name"</span><span>)[</span><span>0</span><span>].</span><span>text</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product-name"</span><span>)[</span><span>1</span><span>].</span><span>text</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>select</span><span>(</span><span>".product.description"</span><span>)[</span><span>0</span><span>].</span><span>text</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>findAll</span><span>(</span><span>"h1"</span><span>,</span> <span>{</span><span>"custom-attr"</span><span>:</span> <span>"price"</span><span>})[</span><span>0</span><span>].</span><span>text</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>.</span><span>find</span><span>(</span><span>"h1"</span><span>,</span> <span>{</span><span>"custom-attr"</span><span>:</span> <span>"price"</span><span>}).</span><span>text</span><span>)</span>

<span>if</span> <span>__name__</span> <span>==</span> <span>'__main__'</span><span>:</span>
    <span>basic</span><span>()</span>
import cfscrape
from bs4 import BeautifulSoup

def basic():
    # string html code sample     html_text = ''' <div> <h1 class="product-name">Product Name 1</h1> <h1 custom-attr="price" class="product-price">100</h1> <p class="product description">This is basic description 1</p> </div> <div> <h1 class="product-name">Product Name 2</h1> <h1 custom-attr="price" class="product-price">200</h1> <p class="product description">This is basic description 2</p> </div> '''
    parsed_html = BeautifulSoup(html_text, 'html.parser')                       # String to HTML     # parsed_html = BeautifulSoup("https://www.google.com", 'html.parser') # URL to HTML     # parsed_html = BeautifulSoup(open("from_file.html"), 'html.parser') # File to HTML 
    print(parsed_html.select(".product-name")[0].text)
    print(parsed_html.select(".product-name")[1].text)
    print(parsed_html.select(".product.description")[0].text)
    print(parsed_html.findAll("h1", {"custom-attr": "price"})[0].text)
    print(parsed_html.find("h1", {"custom-attr": "price"}).text)

if __name__ == '__main__':
    basic()

Enter fullscreen mode Exit fullscreen mode

Now, open a terminal and hit below command, python app.py to run that file.

Learn Anti Bot Scraping

Create app.py file, includes


<span>def</span> <span>anti_bot_scraping</span><span>():</span>
    <span>target_url</span> <span>=</span> <span>"https://www.google.com"</span>   <span># replace url with anti-bot protected website </span>    <span>scraper</span> <span>=</span> <span>cfscrape</span><span>.</span><span>create_scraper</span><span>()</span>
    <span>html_text</span> <span>=</span> <span>scraper</span><span>.</span><span>get</span><span>(</span><span>target_url</span><span>).</span><span>text</span>
    <span>parsed_html</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html_text</span><span>,</span> <span>'html.parser'</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>)</span>
<span>if</span> <span>__name__</span> <span>==</span> <span>'__main__'</span><span>:</span>
    <span>anti_bot_scraping</span><span>()</span>
<span>def</span> <span>anti_bot_scraping</span><span>():</span>

    <span>target_url</span> <span>=</span> <span>"https://www.google.com"</span>   <span># replace url with anti-bot protected website </span>    <span>scraper</span> <span>=</span> <span>cfscrape</span><span>.</span><span>create_scraper</span><span>()</span>
    <span>html_text</span> <span>=</span> <span>scraper</span><span>.</span><span>get</span><span>(</span><span>target_url</span><span>).</span><span>text</span>
    <span>parsed_html</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>html_text</span><span>,</span> <span>'html.parser'</span><span>)</span>
    <span>print</span><span>(</span><span>parsed_html</span><span>)</span>

<span>if</span> <span>__name__</span> <span>==</span> <span>'__main__'</span><span>:</span>
    <span>anti_bot_scraping</span><span>()</span>

def anti_bot_scraping():

    target_url = "https://www.google.com"   # replace url with anti-bot protected website     scraper = cfscrape.create_scraper()
    html_text = scraper.get(target_url).text
    parsed_html = BeautifulSoup(html_text, 'html.parser')
    print(parsed_html)

if __name__ == '__main__':
    anti_bot_scraping()

Enter fullscreen mode Exit fullscreen mode

Now, open a terminal and hit below command, python app.py to run that file.

Notes: Please don't misuse this knowledge. I'm sharing it only for learning purposes or fun purposes.

Enjoy, Coding!

Congratulations!. & Thank You!
Feel free to comments, If you have any issues & queries.

References:

原文链接：Python bypass anti-bot page and scrape it

展开阅读全文

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END