What is a Web Crawler?
Web crawler is an internet bot that is used for web indexing in World Wide Web.All types of search engines use web crawler to provide efficient results.Actually it collects all or some specific hyperlinks and HTML content from other websites and preview them in a suitable manner.When there are huge number of links to crawl , even the largest crawler fails.For this reason search engines early 2000 were bad at providing relevant results,but now this process has improved much and proper results are given in an instant
Python Web Crawler
The web crawler here is created in python3.Python is a high level programming language including object-oriented, imperative, functional programming and a large standard library.
For the web crawler two standard library are used – requests
and BeautfulSoup4
. requests
provides a easy way to connect to world wide web and BeautifulSoup4
is used for some particular string operations.
Example Code
<span>import</span> <span>requests</span><span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span><span>def</span> <span>web</span><span>(</span><span>page</span><span>,</span><span>WebUrl</span><span>):</span><span>if</span><span>(</span><span>page</span><span>></span><span>0</span><span>):</span><span>url</span> <span>=</span> <span>WebUrl</span><span>code</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span><span>plain</span> <span>=</span> <span>code</span><span>.</span><span>text</span><span>s</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>plain</span><span>,</span> <span>"html.parser"</span><span>)</span><span>for</span> <span>link</span> <span>in</span> <span>s</span><span>.</span><span>findAll</span><span>(</span><span>'a'</span><span>,</span> <span>{</span><span>'class'</span><span>:</span><span>'s-access-detail-page'</span><span>}):</span><span>tet</span> <span>=</span> <span>link</span><span>.</span><span>get</span><span>(</span><span>'title'</span><span>)</span><span>print</span><span>(</span><span>tet</span><span>)</span><span>tet_2</span> <span>=</span> <span>link</span><span>.</span><span>get</span><span>(</span><span>'href'</span><span>)</span><span>print</span><span>(</span><span>tet_2</span><span>)</span><span>web</span><span>(</span><span>1</span><span>,</span><span>'http://www.amazon.in/s/ref=s9_acss_bw_cts_VodooFS_T4_w?rh=i%3Aelectronics%2Cn%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031%2Cp_98%3A10440597031%2Cp_36%3A1500000-99999999&bbn=1805560031&rw_html_to_wsrp=1&pf_rd_m=A1K21FY43GMZF8&pf_rd_s=merchandised-search-3&pf_rd_r=2EKZMFFDEXJ5HE8RVV6E&pf_rd_t=101&pf_rd_p=c92c2f88-469b-4b56-936e-0e65f92eebac&pf_rd_i=1389432031'</span><span>)</span><span>import</span> <span>requests</span> <span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span> <span>def</span> <span>web</span><span>(</span><span>page</span><span>,</span><span>WebUrl</span><span>):</span> <span>if</span><span>(</span><span>page</span><span>></span><span>0</span><span>):</span> <span>url</span> <span>=</span> <span>WebUrl</span> <span>code</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span> <span>plain</span> <span>=</span> <span>code</span><span>.</span><span>text</span> <span>s</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>plain</span><span>,</span> <span>"html.parser"</span><span>)</span> <span>for</span> <span>link</span> <span>in</span> <span>s</span><span>.</span><span>findAll</span><span>(</span><span>'a'</span><span>,</span> <span>{</span><span>'class'</span><span>:</span><span>'s-access-detail-page'</span><span>}):</span> <span>tet</span> <span>=</span> <span>link</span><span>.</span><span>get</span><span>(</span><span>'title'</span><span>)</span> <span>print</span><span>(</span><span>tet</span><span>)</span> <span>tet_2</span> <span>=</span> <span>link</span><span>.</span><span>get</span><span>(</span><span>'href'</span><span>)</span> <span>print</span><span>(</span><span>tet_2</span><span>)</span> <span>web</span><span>(</span><span>1</span><span>,</span><span>'http://www.amazon.in/s/ref=s9_acss_bw_cts_VodooFS_T4_w?rh=i%3Aelectronics%2Cn%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031%2Cp_98%3A10440597031%2Cp_36%3A1500000-99999999&bbn=1805560031&rw_html_to_wsrp=1&pf_rd_m=A1K21FY43GMZF8&pf_rd_s=merchandised-search-3&pf_rd_r=2EKZMFFDEXJ5HE8RVV6E&pf_rd_t=101&pf_rd_p=c92c2f88-469b-4b56-936e-0e65f92eebac&pf_rd_i=1389432031'</span><span>)</span>import requests from bs4 import BeautifulSoup def web(page,WebUrl): if(page>0): url = WebUrl code = requests.get(url) plain = code.text s = BeautifulSoup(plain, "html.parser") for link in s.findAll('a', {'class':'s-access-detail-page'}): tet = link.get('title') print(tet) tet_2 = link.get('href') print(tet_2) web(1,'http://www.amazon.in/s/ref=s9_acss_bw_cts_VodooFS_T4_w?rh=i%3Aelectronics%2Cn%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031%2Cp_98%3A10440597031%2Cp_36%3A1500000-99999999&bbn=1805560031&rw_html_to_wsrp=1&pf_rd_m=A1K21FY43GMZF8&pf_rd_s=merchandised-search-3&pf_rd_r=2EKZMFFDEXJ5HE8RVV6E&pf_rd_t=101&pf_rd_p=c92c2f88-469b-4b56-936e-0e65f92eebac&pf_rd_i=1389432031')
Enter fullscreen mode Exit fullscreen mode
Output:
C:\Python34\python.exe C:/Users/Babuya/PycharmProjects/Youtube/web_cr.pyApple iPhone 6 (Gold, 32GB)http://www.amazon.in/Apple-iPhone-6-Gold-32GB/dp/B0725RBY9VOnePlus 5 (Slate Gray 6GB RAM + 64GB memory)http://www.amazon.in/OnePlus-Slate-Gray-64GB-memory/dp/B01NAKTR2HOnePlus 5 (Midnight Black 8GB RAM + 128GB memory)http://www.amazon.in/OnePlus-Midnight-Black-128GB-memory/dp/B01MXZW51MApple iPhone 6 (Space Grey, 32GB)http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01NCN4ICOOnePlus 5 (Soft Gold, 6GB RAM + 64GB memory)http://www.amazon.in/OnePlus-Soft-Gold-64GB-memory/dp/B01N1TYZR2Mi Max 2 (Black, 64 GB)http://www.amazon.in/Mi-Max-Black-64-GB/dp/B073VLGL5YMoto G5 Plus (32GB, Fine Gold)http://www.amazon.in/Moto-Plus-32GB-Fine-Gold/dp/B071ZZ8N5YApple iPhone SE (Space Grey, 32GB)http://www.amazon.in/Apple-iPhone-SE-Space-Grey/dp/B071DF166CHonor 8 Pro (Blue, 6GB RAM + 128GB Memory)http://www.amazon.in/Honor-Pro-Blue-128GB-Memory/dp/B01N4FMUFHApple iPhone 7 (Black, 32GB)http://www.amazon.in/Apple-iPhone-7-Black-32GB/dp/B01LZKSVRBBlackBerry KEYone (LIMITED EDITION BLACK)http://www.amazon.in/BlackBerry-KEYone-LIMITED-EDITION-BLACK/dp/B073ZLLVQ9Apple iPhone SE (Gold, 32GB)http://www.amazon.in/Apple-iPhone-SE-Gold-32GB/dp/B071RC52N6Apple iPhone SE (Rose Gold, 32GB)http://www.amazon.in/Apple-iPhone-SE-Rose-Gold/dp/B06ZXWWD6RApple iPhone 6s (Space Grey, 32GB)http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01LX3A7CCSamsung Galaxy J7 Max (Gold, 32GB)http://www.amazon.in/Samsung-Galaxy-J7-Max-Gold/dp/B073PWKTRSHonor 8 Pro (Black, 6GB RAM + 128GB Memory)http://www.amazon.in/Honor-Pro-Black-128GB-Memory/dp/B01MQXNY1LSamsung Galaxy J7 Max (Black, 32GB)http://www.amazon.in/Samsung-Galaxy-J7-Max-Black/dp/B073PWDMHDOnePlus 3T (Soft Gold, 6GB RAM + 64GB memory)http://www.amazon.in/OnePlus-3T-Soft-Gold-memory/dp/B01FM7J3NAApple iPhone 6s (Gold, 32GB)http://www.amazon.in/Apple-iPhone-6s-Gold-32GB/dp/B01M0CJNVLApple iPhone 6s (Rose Gold, 32GB)http://www.amazon.in/Apple-iPhone-Rose-Gold-32GB/dp/B01LXF3SP9Samsung Galaxy C7 Pro (Navy Blue, 64GB)http://www.amazon.in/Samsung-Galaxy-Navy-Blue-64GB/dp/B01LXMHNMQSamsung J7 Prime 32GB ( Gold ) 4G VoLTEhttp://www.amazon.in/Samsung-J7-Prime-32GB-VoLTE/dp/B06Y3HFZBQVivo V5s (Matte Black) with Offershttp://www.amazon.in/Vivo-V5s-Matte-Black-Offers/dp/B071P2FNF2Vivo V5s (Crown Gold) with Offershttp://www.amazon.in/Vivo-V5s-Crown-Gold-Offers/dp/B071VT6RG2C:\Python34\python.exe C:/Users/Babuya/PycharmProjects/Youtube/web_cr.py Apple iPhone 6 (Gold, 32GB) http://www.amazon.in/Apple-iPhone-6-Gold-32GB/dp/B0725RBY9V OnePlus 5 (Slate Gray 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-Slate-Gray-64GB-memory/dp/B01NAKTR2H OnePlus 5 (Midnight Black 8GB RAM + 128GB memory) http://www.amazon.in/OnePlus-Midnight-Black-128GB-memory/dp/B01MXZW51M Apple iPhone 6 (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01NCN4ICO OnePlus 5 (Soft Gold, 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-Soft-Gold-64GB-memory/dp/B01N1TYZR2 Mi Max 2 (Black, 64 GB) http://www.amazon.in/Mi-Max-Black-64-GB/dp/B073VLGL5Y Moto G5 Plus (32GB, Fine Gold) http://www.amazon.in/Moto-Plus-32GB-Fine-Gold/dp/B071ZZ8N5Y Apple iPhone SE (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-SE-Space-Grey/dp/B071DF166C Honor 8 Pro (Blue, 6GB RAM + 128GB Memory) http://www.amazon.in/Honor-Pro-Blue-128GB-Memory/dp/B01N4FMUFH Apple iPhone 7 (Black, 32GB) http://www.amazon.in/Apple-iPhone-7-Black-32GB/dp/B01LZKSVRB BlackBerry KEYone (LIMITED EDITION BLACK) http://www.amazon.in/BlackBerry-KEYone-LIMITED-EDITION-BLACK/dp/B073ZLLVQ9 Apple iPhone SE (Gold, 32GB) http://www.amazon.in/Apple-iPhone-SE-Gold-32GB/dp/B071RC52N6 Apple iPhone SE (Rose Gold, 32GB) http://www.amazon.in/Apple-iPhone-SE-Rose-Gold/dp/B06ZXWWD6R Apple iPhone 6s (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01LX3A7CC Samsung Galaxy J7 Max (Gold, 32GB) http://www.amazon.in/Samsung-Galaxy-J7-Max-Gold/dp/B073PWKTRS Honor 8 Pro (Black, 6GB RAM + 128GB Memory) http://www.amazon.in/Honor-Pro-Black-128GB-Memory/dp/B01MQXNY1L Samsung Galaxy J7 Max (Black, 32GB) http://www.amazon.in/Samsung-Galaxy-J7-Max-Black/dp/B073PWDMHD OnePlus 3T (Soft Gold, 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-3T-Soft-Gold-memory/dp/B01FM7J3NA Apple iPhone 6s (Gold, 32GB) http://www.amazon.in/Apple-iPhone-6s-Gold-32GB/dp/B01M0CJNVL Apple iPhone 6s (Rose Gold, 32GB) http://www.amazon.in/Apple-iPhone-Rose-Gold-32GB/dp/B01LXF3SP9 Samsung Galaxy C7 Pro (Navy Blue, 64GB) http://www.amazon.in/Samsung-Galaxy-Navy-Blue-64GB/dp/B01LXMHNMQ Samsung J7 Prime 32GB ( Gold ) 4G VoLTE http://www.amazon.in/Samsung-J7-Prime-32GB-VoLTE/dp/B06Y3HFZBQ Vivo V5s (Matte Black) with Offers http://www.amazon.in/Vivo-V5s-Matte-Black-Offers/dp/B071P2FNF2 Vivo V5s (Crown Gold) with Offers http://www.amazon.in/Vivo-V5s-Crown-Gold-Offers/dp/B071VT6RG2C:\Python34\python.exe C:/Users/Babuya/PycharmProjects/Youtube/web_cr.py Apple iPhone 6 (Gold, 32GB) http://www.amazon.in/Apple-iPhone-6-Gold-32GB/dp/B0725RBY9V OnePlus 5 (Slate Gray 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-Slate-Gray-64GB-memory/dp/B01NAKTR2H OnePlus 5 (Midnight Black 8GB RAM + 128GB memory) http://www.amazon.in/OnePlus-Midnight-Black-128GB-memory/dp/B01MXZW51M Apple iPhone 6 (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01NCN4ICO OnePlus 5 (Soft Gold, 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-Soft-Gold-64GB-memory/dp/B01N1TYZR2 Mi Max 2 (Black, 64 GB) http://www.amazon.in/Mi-Max-Black-64-GB/dp/B073VLGL5Y Moto G5 Plus (32GB, Fine Gold) http://www.amazon.in/Moto-Plus-32GB-Fine-Gold/dp/B071ZZ8N5Y Apple iPhone SE (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-SE-Space-Grey/dp/B071DF166C Honor 8 Pro (Blue, 6GB RAM + 128GB Memory) http://www.amazon.in/Honor-Pro-Blue-128GB-Memory/dp/B01N4FMUFH Apple iPhone 7 (Black, 32GB) http://www.amazon.in/Apple-iPhone-7-Black-32GB/dp/B01LZKSVRB BlackBerry KEYone (LIMITED EDITION BLACK) http://www.amazon.in/BlackBerry-KEYone-LIMITED-EDITION-BLACK/dp/B073ZLLVQ9 Apple iPhone SE (Gold, 32GB) http://www.amazon.in/Apple-iPhone-SE-Gold-32GB/dp/B071RC52N6 Apple iPhone SE (Rose Gold, 32GB) http://www.amazon.in/Apple-iPhone-SE-Rose-Gold/dp/B06ZXWWD6R Apple iPhone 6s (Space Grey, 32GB) http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01LX3A7CC Samsung Galaxy J7 Max (Gold, 32GB) http://www.amazon.in/Samsung-Galaxy-J7-Max-Gold/dp/B073PWKTRS Honor 8 Pro (Black, 6GB RAM + 128GB Memory) http://www.amazon.in/Honor-Pro-Black-128GB-Memory/dp/B01MQXNY1L Samsung Galaxy J7 Max (Black, 32GB) http://www.amazon.in/Samsung-Galaxy-J7-Max-Black/dp/B073PWDMHD OnePlus 3T (Soft Gold, 6GB RAM + 64GB memory) http://www.amazon.in/OnePlus-3T-Soft-Gold-memory/dp/B01FM7J3NA Apple iPhone 6s (Gold, 32GB) http://www.amazon.in/Apple-iPhone-6s-Gold-32GB/dp/B01M0CJNVL Apple iPhone 6s (Rose Gold, 32GB) http://www.amazon.in/Apple-iPhone-Rose-Gold-32GB/dp/B01LXF3SP9 Samsung Galaxy C7 Pro (Navy Blue, 64GB) http://www.amazon.in/Samsung-Galaxy-Navy-Blue-64GB/dp/B01LXMHNMQ Samsung J7 Prime 32GB ( Gold ) 4G VoLTE http://www.amazon.in/Samsung-J7-Prime-32GB-VoLTE/dp/B06Y3HFZBQ Vivo V5s (Matte Black) with Offers http://www.amazon.in/Vivo-V5s-Matte-Black-Offers/dp/B071P2FNF2 Vivo V5s (Crown Gold) with Offers http://www.amazon.in/Vivo-V5s-Crown-Gold-Offers/dp/B071VT6RG2
Enter fullscreen mode Exit fullscreen mode
Here this crawler collects all the product headings and respective links of the products pages from a page of amazon.in . User just need to specify what kind of data or links to be crawled.Though the main use of web crawler is in search engines,this way it can also be used to collect some useful information.
Here all the HTML of the page is fetched using requests
in plain text form.Then it is converted into a BeautifulSoup
object.From that object all title and href with class s-access-detail-page
is accessed.That’s all how this basic web crawler works.
暂无评论内容