Web Scraping Vs Web Crawling

Web Scraping or Web Crawling

Search and gather Aka crawling and scraping refers to the acquisition of important website data by the use of automated bots. Web scraping is pretty common to track and analyze data and compare to its former self, Examples may include the Market data, finance, E-Commerce and Retail . Now you may ask, What exactly does it mean to crawl a website, What does it mean to Scrap a website?

| How is it related to each other?

Suppose you have a Gmail with no storage left (Which I hope you don’t) and you wish to acquire one important file, What would you do? You would Give up Start to go through each file and Stalin sort the files to get the right one. This exact combined action of seperating and acquiring the important data translates to a webpage cohesively which is termed by Crawling and Gathering.

The Good, the Bad and the Wayback machine

Established in 1996, by Brewster Kahle and Bruce Gilliat, The wayback machine aka The internet Archive aka the warehouse of digital content that has seen its testament of time. It allows users to access the archvied versions of the website, evenn allowing you to navigate the website through its establishment. It works by sending automated web crawlers to various publicly available websites amd taking snapshots. It can be easily accessed and used by all, at https://wayback-api.archive.org/

What it can’t store

With large data comes big storage bills“, With a infinite pile of information coming up on its doorsteps, its storage capabilites have increased tenfolds. As of January 2024, It stores around 99 Petabytes, and is expected to increase about 100 Terabytes per month, such renders the Internet Archive unable to store the following

  • Dynamic Pages
  • Emails
  • Chats
  • Databases
  • Classified Military Content (Obviously)

“Talk is Cheap. Show me the Code”

-Linus Torvalds

Creating your own time capsule is very easy by setting up a Web Crawler that preys into the website and collects data at regular intervals of time. Creation of your own bot for scraping is easily achieveable using various libraries like BeauitfulSoup (for Python) and Cheerio (for Javascript)

For Python Enthusiasts

| You can install the libraries installed using the following pip command

<span>pip</span> <span>install</span> <span>beautifulsoup4</span>
<span>pip</span> <span>install</span> <span>beautifulsoup4</span>
pip install beautifulsoup4

Enter fullscreen mode Exit fullscreen mode

It utilises

| Code:

<span>import</span> <span>requests</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>
<span>def</span> <span>crawl_page</span><span>(</span><span>url</span><span>):</span>
<span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>
<span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>response</span><span>.</span><span>content</span><span>,</span> <span>"</span><span>html.parser</span><span>"</span><span>)</span>
<span>links</span> <span>=</span> <span>[]</span>
<span>for</span> <span>a_tag</span> <span>in</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>"</span><span>a</span><span>"</span><span>,</span> <span>href</span><span>=</span><span>True</span><span>):</span>
<span>link</span> <span>=</span> <span>a_tag</span><span>[</span><span>"</span><span>href</span><span>"</span><span>]</span>
<span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>http</span><span>"</span><span>):</span>
<span>links</span><span>.</span><span>append</span><span>(</span><span>link</span><span>)</span>
<span>return</span> <span>links</span>
<span>seed_url</span> <span>=</span> <span>"</span><span>https://en.wikipedia.org/wiki/Ludic_fallacy</span><span>"</span>
<span>visited_urls</span> <span>=</span> <span>[]</span>
<span>crawl_depth</span> <span>=</span> <span>2</span>
<span>def</span> <span>crawl</span><span>(</span><span>url</span><span>,</span> <span>depth</span><span>):</span>
<span>if</span> <span>depth</span> <span>==</span> <span>0</span> <span>or</span> <span>url</span> <span>in</span> <span>visited_urls</span><span>:</span>
<span>return</span>
<span>visited_urls</span><span>.</span><span>append</span><span>(</span><span>url</span><span>)</span>
<span>links</span> <span>=</span> <span>crawl_page</span><span>(</span><span>url</span><span>)</span>
<span>for</span> <span>link</span> <span>in</span> <span>links</span><span>:</span>
<span>crawl</span><span>(</span><span>link</span><span>,</span> <span>depth</span><span>-</span><span>1</span><span>)</span>
<span>crawl</span><span>(</span><span>"</span><span>https://en.wikipedia.org/wiki/Ludic_fallacy</span><span>"</span><span>,</span> <span>2</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Crawled URLs:</span><span>"</span><span>,</span> <span>visited_urls</span><span>)</span><span>``</span><span>`</span>
<span>{</span><span>%</span> <span>endraw</span> <span>%</span><span>}</span>
<span>{</span><span>%</span> <span>raw</span> <span>%</span><span>}</span>
<span>import</span> <span>requests</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>

<span>def</span> <span>crawl_page</span><span>(</span><span>url</span><span>):</span>
  <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>
  <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>response</span><span>.</span><span>content</span><span>,</span> <span>"</span><span>html.parser</span><span>"</span><span>)</span>
  <span>links</span> <span>=</span> <span>[]</span>
  <span>for</span> <span>a_tag</span> <span>in</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>"</span><span>a</span><span>"</span><span>,</span> <span>href</span><span>=</span><span>True</span><span>):</span>
    <span>link</span> <span>=</span> <span>a_tag</span><span>[</span><span>"</span><span>href</span><span>"</span><span>]</span>
    <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>http</span><span>"</span><span>):</span>
      <span>links</span><span>.</span><span>append</span><span>(</span><span>link</span><span>)</span>
  <span>return</span> <span>links</span>

<span>seed_url</span> <span>=</span> <span>"</span><span>https://en.wikipedia.org/wiki/Ludic_fallacy</span><span>"</span>
<span>visited_urls</span> <span>=</span> <span>[]</span>
<span>crawl_depth</span> <span>=</span> <span>2</span>

<span>def</span> <span>crawl</span><span>(</span><span>url</span><span>,</span> <span>depth</span><span>):</span>
  <span>if</span> <span>depth</span> <span>==</span> <span>0</span> <span>or</span> <span>url</span> <span>in</span> <span>visited_urls</span><span>:</span>
    <span>return</span>
  <span>visited_urls</span><span>.</span><span>append</span><span>(</span><span>url</span><span>)</span>
  <span>links</span> <span>=</span> <span>crawl_page</span><span>(</span><span>url</span><span>)</span>
  <span>for</span> <span>link</span> <span>in</span> <span>links</span><span>:</span>
    <span>crawl</span><span>(</span><span>link</span><span>,</span> <span>depth</span><span>-</span><span>1</span><span>)</span>

<span>crawl</span><span>(</span><span>"</span><span>https://en.wikipedia.org/wiki/Ludic_fallacy</span><span>"</span><span>,</span> <span>2</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Crawled URLs:</span><span>"</span><span>,</span> <span>visited_urls</span><span>)</span><span>``</span><span>`</span>
<span>{</span><span>%</span> <span>endraw</span> <span>%</span><span>}</span>

<span>{</span><span>%</span> <span>raw</span> <span>%</span><span>}</span>
import requests from bs4 import BeautifulSoup def crawl_page(url): response = requests.get(url) soup = BeautifulSoup(response.content, "html.parser") links = [] for a_tag in soup.find_all("a", href=True): link = a_tag["href"] if link.startswith("http"): links.append(link) return links seed_url = "https://en.wikipedia.org/wiki/Ludic_fallacy" visited_urls = [] crawl_depth = 2 def crawl(url, depth): if depth == 0 or url in visited_urls: return visited_urls.append(url) links = crawl_page(url) for link in links: crawl(link, depth-1) crawl("https://en.wikipedia.org/wiki/Ludic_fallacy", 2) print("Crawled URLs:", visited_urls)``` {% endraw %} {% raw %}

Enter fullscreen mode Exit fullscreen mode

For Javascript Enthusiasts

| Prerequisites include libraries such as Axios and Cheerio

npm install axios cheerio
npm install axios cheerio
npm install axios cheerio

Enter fullscreen mode Exit fullscreen mode

Axios fulfills the job of making HTTP Requests to the website while Cheerio manipulates the incoming website data and allows you to extract valuable information using CSS-Style selectors which stores the extracted data as JSON files as objects with properties

| Code:

javascript
const axios = require('axios');
const cheerio = require('cheerio');
const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy';
async function scrapeData() {
try {
const response = await axios.get(targetUrl);
const html = response.data;
const $ = cheerio.load(html);
const titles = $('h1').text().trim();
const descriptions = $('p').text().trim();
console.log('Titles:', titles);
console.log('Descriptions:', descriptions);
} catch (error) {
console.error('Error scraping data:', error);
}
}
scrapeData();
javascript
const axios = require('axios');
const cheerio = require('cheerio');
const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy';

async function scrapeData() {
  try {
    const response = await axios.get(targetUrl);
    const html = response.data;
    const $ = cheerio.load(html);
    const titles = $('h1').text().trim();
    const descriptions = $('p').text().trim();
    console.log('Titles:', titles);
    console.log('Descriptions:', descriptions);
  } catch (error) {
    console.error('Error scraping data:', error);
  }
}

scrapeData();
javascript const axios = require('axios'); const cheerio = require('cheerio'); const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy'; async function scrapeData() { try { const response = await axios.get(targetUrl); const html = response.data; const $ = cheerio.load(html); const titles = $('h1').text().trim(); const descriptions = $('p').text().trim(); console.log('Titles:', titles); console.log('Descriptions:', descriptions); } catch (error) { console.error('Error scraping data:', error); } } scrapeData();

Enter fullscreen mode Exit fullscreen mode

Make sure to be mindful of the website’s terms and conditions and abide by by the robots.txt to pratice ethical scraping and to prevent yourself from legal trouble and have fun coding along the way.

原文链接:Web Scraping Vs Web Crawling

© 版权声明
THE END
喜欢就支持一下吧
点赞14 分享
So what if we fall down? At least we are still young.
摔倒了又怎样,至少我们还年轻
评论 抢沙发

请登录后发表评论

    暂无评论内容