Web Scraping or Web Crawling

Search and gather Aka crawling and scraping refers to the acquisition of important website data by the use of automated bots. Web scraping is pretty common to track and analyze data and compare to its former self, Examples may include the Market data, finance, E-Commerce and Retail . Now you may ask, What exactly does it mean to crawl a website, What does it mean to Scrap a website?

| How is it related to each other?

Suppose you have a Gmail with no storage left (Which I hope you don’t) and you wish to acquire one important file, What would you do? You would Give up Start to go through each file and Stalin sort the files to get the right one. This exact combined action of seperating and acquiring the important data translates to a webpage cohesively which is termed by Crawling and Gathering.

The Good, the Bad and the Wayback machine

Established in 1996, by Brewster Kahle and Bruce Gilliat, The wayback machine aka The internet Archive aka the warehouse of digital content that has seen its testament of time. It allows users to access the archvied versions of the website, evenn allowing you to navigate the website through its establishment. It works by sending automated web crawlers to various publicly available websites amd taking snapshots. It can be easily accessed and used by all, at https://wayback-api.archive.org/

What it can’t store

“With large data comes big storage bills“, With a infinite pile of information coming up on its doorsteps, its storage capabilites have increased tenfolds. As of January 2024, It stores around 99 Petabytes, and is expected to increase about 100 Terabytes per month, such renders the Internet Archive unable to store the following

Dynamic Pages

Emails

Chats

Databases

Classified Military Content (Obviously)

“Talk is Cheap. Show me the Code”

-Linus Torvalds

Creating your own time capsule is very easy by setting up a Web Crawler that preys into the website and collects data at regular intervals of time. Creation of your own bot for scraping is easily achieveable using various libraries like BeauitfulSoup (for Python) and Cheerio (for Javascript)

For Python Enthusiasts

| You can install the libraries installed using the following pip command


<span>pip</span> <span>install</span> <span>beautifulsoup4</span>
<span>pip</span> <span>install</span> <span>beautifulsoup4</span>
pip install beautifulsoup4

Enter fullscreen mode Exit fullscreen mode

It utilises

| Code:


<span>import</span> <span>requests</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>
<span>def</span> <span>crawl_page</span><span>(</span><span>url</span><span>):</span>
  <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>
  <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>response</span><span>.</span><span>content</span><span>,</span> <span>"</span><span>html.parser</span><span>"</span><span>)</span>
  <span>links</span> <span>=</span> <span>[]</span>
  <span>for</span> <span>a_tag</span> <span>in</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>"</span><span>a</span><span>"</span><span>,</span> <span>href</span><span>=</span><span>True</span><span>):</span>
    <span>link</span> <span>=</span> <span>a_tag</span><span>[</span><span>"</span><span>href</span><span>"</span><span>]</span>
    <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>http</span><span>"</span><span>):</span>
      <span>links</span><span>.</span><span>append</span><span>(</span><span>link</span><span>)</span>
  <span>return</span> <span>links</span>
<span>seed_url</span> <span>=</span> <span>"</span><span>https://en.wikipedia.org/wiki/Ludic_fallacy</span><span>"</span>
<span>visited_urls</span> <span>=</span> <span>[]</span>
<span>crawl_depth</span> <span>=</span> <span>2</span>
<span>def</span> <span>crawl</span><span>(</span><span>url</span><span>,</span> <span>depth</span><span>):</span>
  <span>if</span> <span>depth</span> <span>==</span> <span>0</span> <span>or</span> <span>url</span> <span>in</span> <span>visited_urls</span><span>:</span>
    <span>return</span>
  <span>visited_urls</span><span>.</span><span>append</span><span>(</span><span>url</span><span>)</span>
  <span>links</span> <span>=</span> <span>crawl_page</span><span>(</span><span>url</span><span>)</span>
  <span>for</span> <span>link</span> <span>in</span> <span>links</span><span>:</span>
    <span>crawl</span><span>(</span><span>link</span><span>,</span> <span>depth</span><span>-</span><span>1</span><span>)</span>
<span>crawl</span><span>(</span><span>"</span><span>https://en.wikipedia.org/wiki/Ludic_fallacy</span><span>"</span><span>,</span> <span>2</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Crawled URLs:</span><span>"</span><span>,</span> <span>visited_urls</span><span>)</span><span>``</span><span>`</span>
<span>{</span><span>%</span> <span>endraw</span> <span>%</span><span>}</span>
<span>{</span><span>%</span> <span>raw</span> <span>%</span><span>}</span>
<span>import</span> <span>requests</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>

<span>def</span> <span>crawl_page</span><span>(</span><span>url</span><span>):</span>
  <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>
  <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>response</span><span>.</span><span>content</span><span>,</span> <span>"</span><span>html.parser</span><span>"</span><span>)</span>
  <span>links</span> <span>=</span> <span>[]</span>
  <span>for</span> <span>a_tag</span> <span>in</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>"</span><span>a</span><span>"</span><span>,</span> <span>href</span><span>=</span><span>True</span><span>):</span>
    <span>link</span> <span>=</span> <span>a_tag</span><span>[</span><span>"</span><span>href</span><span>"</span><span>]</span>
    <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>http</span><span>"</span><span>):</span>
      <span>links</span><span>.</span><span>append</span><span>(</span><span>link</span><span>)</span>
  <span>return</span> <span>links</span>

<span>seed_url</span> <span>=</span> <span>"</span><span>https://en.wikipedia.org/wiki/Ludic_fallacy</span><span>"</span>
<span>visited_urls</span> <span>=</span> <span>[]</span>
<span>crawl_depth</span> <span>=</span> <span>2</span>

<span>def</span> <span>crawl</span><span>(</span><span>url</span><span>,</span> <span>depth</span><span>):</span>
  <span>if</span> <span>depth</span> <span>==</span> <span>0</span> <span>or</span> <span>url</span> <span>in</span> <span>visited_urls</span><span>:</span>
    <span>return</span>
  <span>visited_urls</span><span>.</span><span>append</span><span>(</span><span>url</span><span>)</span>
  <span>links</span> <span>=</span> <span>crawl_page</span><span>(</span><span>url</span><span>)</span>
  <span>for</span> <span>link</span> <span>in</span> <span>links</span><span>:</span>
    <span>crawl</span><span>(</span><span>link</span><span>,</span> <span>depth</span><span>-</span><span>1</span><span>)</span>

<span>crawl</span><span>(</span><span>"</span><span>https://en.wikipedia.org/wiki/Ludic_fallacy</span><span>"</span><span>,</span> <span>2</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Crawled URLs:</span><span>"</span><span>,</span> <span>visited_urls</span><span>)</span><span>``</span><span>`</span>
<span>{</span><span>%</span> <span>endraw</span> <span>%</span><span>}</span>

<span>{</span><span>%</span> <span>raw</span> <span>%</span><span>}</span>
import requests
from bs4 import BeautifulSoup

def crawl_page(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.content, "html.parser")
  links = []
  for a_tag in soup.find_all("a", href=True):
    link = a_tag["href"]
    if link.startswith("http"):
      links.append(link)
  return links

seed_url = "https://en.wikipedia.org/wiki/Ludic_fallacy"
visited_urls = []
crawl_depth = 2

def crawl(url, depth):
  if depth == 0 or url in visited_urls:
    return
  visited_urls.append(url)
  links = crawl_page(url)
  for link in links:
    crawl(link, depth-1)

crawl("https://en.wikipedia.org/wiki/Ludic_fallacy", 2)
print("Crawled URLs:", visited_urls)```
{% endraw %}

{% raw %}

Enter fullscreen mode Exit fullscreen mode

For Javascript Enthusiasts

| Prerequisites include libraries such as Axios and Cheerio


npm install axios cheerio
npm install axios cheerio


npm install axios cheerio

Enter fullscreen mode Exit fullscreen mode

Axios fulfills the job of making HTTP Requests to the website while Cheerio manipulates the incoming website data and allows you to extract valuable information using CSS-Style selectors which stores the extracted data as JSON files as objects with properties

| Code:


javascript
const axios = require('axios');
const cheerio = require('cheerio');
const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy';
async function scrapeData() {
  try {
    const response = await axios.get(targetUrl);
    const html = response.data;
    const $ = cheerio.load(html);
    const titles = $('h1').text().trim();
    const descriptions = $('p').text().trim();
    console.log('Titles:', titles);
    console.log('Descriptions:', descriptions);
  } catch (error) {
    console.error('Error scraping data:', error);
  }
}
scrapeData();
javascript
const axios = require('axios');
const cheerio = require('cheerio');
const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy';

async function scrapeData() {
  try {
    const response = await axios.get(targetUrl);
    const html = response.data;
    const $ = cheerio.load(html);
    const titles = $('h1').text().trim();
    const descriptions = $('p').text().trim();
    console.log('Titles:', titles);
    console.log('Descriptions:', descriptions);
  } catch (error) {
    console.error('Error scraping data:', error);
  }
}

scrapeData();

javascript
const axios = require('axios');
const cheerio = require('cheerio');
const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy';

async function scrapeData() {
  try {
    const response = await axios.get(targetUrl);
    const html = response.data;
    const $ = cheerio.load(html);
    const titles = $('h1').text().trim();
    const descriptions = $('p').text().trim();
    console.log('Titles:', titles);
    console.log('Descriptions:', descriptions);
  } catch (error) {
    console.error('Error scraping data:', error);
  }
}

scrapeData();

Enter fullscreen mode Exit fullscreen mode

Make sure to be mindful of the website’s terms and conditions and abide by by the robots.txt to pratice ethical scraping and to prevent yourself from legal trouble and have fun coding along the way.

原文链接：Web Scraping Vs Web Crawling

展开阅读全文

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END