Python is one of the most popular programming languages for extracting, processing, and analyzing data. The inbuilt and third-party libraries of Python make it very easy for a developer to get specific data from a web page and make results around those data sets.
In this article, I have covered a simple Python script that can extract links from a given url of a web page and create a CSV file containing all the links present on that web page with extra information telling whether the link is external or internal.
Prerequisite
As it is a python article with the program, so it goes without saying that you need to have basic knowledge of Python and Python installed on your system to test the program for yourself.
If you are on a new system, you can easily install the latest version of Python with this quick download link.
To make the program, I will use 4 Python libraries, among which two libraries are third-party libraries, and the other two are built-in.
Libraries
1. requests:
requests is the popular python HTTP library. We will use this library to make an HTTP request for the url which links we want to check.
As requests is a third-party library, we need to install it for our Python environment using the pip command.
pip install requests
2. Beautiful soup:
Beautiful soup is a third-party Python library that can extract data from HTML and XML files. Generally, a web page is an HTML document, and we can use the Python beautiful soup to extract links from that web page.
Use the following command to install beautiful soup
pip install beautifulsoup4
3. csv
csv modules come with Python, and we can write, read and append between .csv files using this module.
4. datetime
datetime is also an inbuilt Python module that can deal with date and time.
Program
Now let’s use all these 4 Python modules and write a Program that can tell all the internal and external links of a web page and export that data into a .csv file.
I have divided this program into three functions to make it modular.
Function 1: requestMaker(url)
The requestMake(url)
function accepts the url as a string and sends a get request to the url using the .get()
method.
After making the request, inside the requestMaker()
function, I collected the response web page HTML content and the url using the .text
and .url
properties.
And called the parseLinks(pageHtml, pageUrl)
function.
<span>#to make the HTTP request to the give url </span><span>def</span> <span>requestMaker</span><span>(</span><span>url</span><span>):</span><span>try</span><span>:</span><span>#make the get request to the url </span> <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span><span>#if the request is successful </span> <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>in</span> <span>range</span><span>(</span><span>200</span><span>,</span> <span>300</span><span>):</span><span>#extract the page html content for parsing the links </span> <span>pageHtml</span> <span>=</span> <span>response</span><span>.</span><span>text</span><span>pageUrl</span> <span>=</span> <span>response</span><span>.</span><span>url</span><span>#call the parseLink function </span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>)</span><span>else</span><span>:</span><span>print</span><span>(</span><span>"</span><span>Sorry Could not fetch the result status code {response.status_code}!</span><span>"</span><span>)</span><span>except</span><span>:</span><span>print</span><span>(</span><span>f</span><span>"</span><span>Could Not Connect to url </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span><span>#to make the HTTP request to the give url </span><span>def</span> <span>requestMaker</span><span>(</span><span>url</span><span>):</span> <span>try</span><span>:</span> <span>#make the get request to the url </span> <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span> <span>#if the request is successful </span> <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>in</span> <span>range</span><span>(</span><span>200</span><span>,</span> <span>300</span><span>):</span> <span>#extract the page html content for parsing the links </span> <span>pageHtml</span> <span>=</span> <span>response</span><span>.</span><span>text</span> <span>pageUrl</span> <span>=</span> <span>response</span><span>.</span><span>url</span> <span>#call the parseLink function </span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>)</span> <span>else</span><span>:</span> <span>print</span><span>(</span><span>"</span><span>Sorry Could not fetch the result status code {response.status_code}!</span><span>"</span><span>)</span> <span>except</span><span>:</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>Could Not Connect to url </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>#to make the HTTP request to the give url def requestMaker(url): try: #make the get request to the url response = requests.get(url) #if the request is successful if response.status_code in range(200, 300): #extract the page html content for parsing the links pageHtml = response.text pageUrl = response.url #call the parseLink function parseLinks(pageHtml, pageUrl) else: print("Sorry Could not fetch the result status code {response.status_code}!") except: print(f"Could Not Connect to url {url}")
Enter fullscreen mode Exit fullscreen mode
Function 2: parseLinks(pageHtml, pageUrl)
The parseLinks()
function accept the pageHtml
and pageUrl
as string, and parse the pageHTML
string using the BeautiulSoup
module with HTML parser as a soup object. And with the soup object we collected a list of all the <a>
tags present in the HTML page using the .find_all('a')
method.
Then inside the parseLinks()
function I have called the extIntLinks(allLinks, pageUrl)
function.
<span>#parse all the links from the web page </span><span>def</span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>):</span><span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>pageHtml</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span><span>#get all the <a> elements from the HTML page </span> <span>allLinks</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span><span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>)</span><span>#parse all the links from the web page </span><span>def</span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>):</span> <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>pageHtml</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span> <span>#get all the <a> elements from the HTML page </span> <span>allLinks</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>)</span>#parse all the links from the web page def parseLinks(pageHtml, pageUrl): soup = BeautifulSoup(pageHtml, 'html.parser') #get all the <a> elements from the HTML page allLinks = soup.find_all('a') extIntLinks(allLinks, pageUrl)
Enter fullscreen mode Exit fullscreen mode
Function 3: extIntLinks(allLinks, pageUrl)
The extIntLinks(allLinks, pageUrl)
function does the following things.
- Create a unique
.csv
file name using the datetime module. - Create the unique .csv file in write mode.
- Loop through all the extracted
<a>
links - Check for the internal and external links.
- Write the data into the csv file.
<span>def</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>):</span><span>#filename </span> <span>currentTime</span> <span>=</span> <span>datetime</span><span>.</span><span>datetime</span><span>.</span><span>now</span><span>()</span><span>#create a unique .csv file name using the datetime module </span> <span>filename</span> <span>=</span> <span>f</span><span>"</span><span>Links-</span><span>{</span><span>currentTime</span><span>.</span><span>day</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>month</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>year</span><span>}</span><span> </span><span>{</span><span>currentTime</span><span>.</span><span>hour</span><span>}{</span><span>currentTime</span><span>.</span><span>minute</span><span>}{</span><span>currentTime</span><span>.</span><span>second</span><span>}</span><span>.csv</span><span>"</span><span>with</span> <span>open</span><span>(</span><span>filename</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>,</span> <span>newline</span><span>=</span><span>''</span><span>)</span> <span>as</span> <span>csvfile</span><span>:</span><span>fieldnames</span> <span>=</span> <span>[</span><span>'</span><span>Tested Url</span><span>'</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>]</span><span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>DictWriter</span><span>(</span><span>csvfile</span><span>,</span> <span>fieldnames</span><span>=</span><span>fieldnames</span><span>)</span><span>writer</span><span>.</span><span>writeheader</span><span>()</span><span>internalLinks</span> <span>=</span> <span>0</span><span>externalLinks</span> <span>=</span> <span>0</span><span>#go through all the <a> elements list </span> <span>for</span> <span>anchor</span> <span>in</span> <span>allLinks</span><span>:</span><span>link</span> <span>=</span> <span>anchor</span><span>.</span><span>get</span><span>(</span><span>"</span><span>href</span><span>"</span><span>)</span> <span>#get the link from the <a> element </span><span>#check if the link is internal </span> <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>pageUrl</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>#</span><span>"</span><span>)</span> <span>:</span><span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>Internal</span><span>'</span><span>})</span><span>internalLinks</span><span>+=</span><span>1</span><span>#if the link is external </span> <span>else</span><span>:</span><span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>External</span><span>'</span><span>})</span><span>externalLinks</span><span>+=</span><span>1</span><span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>csvfile</span><span>)</span><span>writer</span><span>.</span><span>writerow</span><span>([</span><span>"</span><span>Total Internal Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>internalLinks</span><span>}</span><span>"</span><span>,</span> <span>"</span><span>Total External Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>externalLinks</span><span>}</span><span>"</span><span>])</span><span>print</span><span>(</span><span>f</span><span>"</span><span>The page </span><span>{</span><span>url</span><span>}</span><span> has </span><span>{</span><span>internalLinks</span><span>}</span><span> Internal Link(s) and </span><span>{</span><span>externalLinks</span><span>}</span><span> External Link(s)</span><span>"</span><span>)</span><span>print</span><span>(</span><span>f</span><span>"</span><span>And data has been saved in the </span><span>{</span><span>filename</span><span>}</span><span>"</span><span>)</span><span>def</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>):</span> <span>#filename </span> <span>currentTime</span> <span>=</span> <span>datetime</span><span>.</span><span>datetime</span><span>.</span><span>now</span><span>()</span> <span>#create a unique .csv file name using the datetime module </span> <span>filename</span> <span>=</span> <span>f</span><span>"</span><span>Links-</span><span>{</span><span>currentTime</span><span>.</span><span>day</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>month</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>year</span><span>}</span><span> </span><span>{</span><span>currentTime</span><span>.</span><span>hour</span><span>}{</span><span>currentTime</span><span>.</span><span>minute</span><span>}{</span><span>currentTime</span><span>.</span><span>second</span><span>}</span><span>.csv</span><span>"</span> <span>with</span> <span>open</span><span>(</span><span>filename</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>,</span> <span>newline</span><span>=</span><span>''</span><span>)</span> <span>as</span> <span>csvfile</span><span>:</span> <span>fieldnames</span> <span>=</span> <span>[</span><span>'</span><span>Tested Url</span><span>'</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>]</span> <span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>DictWriter</span><span>(</span><span>csvfile</span><span>,</span> <span>fieldnames</span><span>=</span><span>fieldnames</span><span>)</span> <span>writer</span><span>.</span><span>writeheader</span><span>()</span> <span>internalLinks</span> <span>=</span> <span>0</span> <span>externalLinks</span> <span>=</span> <span>0</span> <span>#go through all the <a> elements list </span> <span>for</span> <span>anchor</span> <span>in</span> <span>allLinks</span><span>:</span> <span>link</span> <span>=</span> <span>anchor</span><span>.</span><span>get</span><span>(</span><span>"</span><span>href</span><span>"</span><span>)</span> <span>#get the link from the <a> element </span> <span>#check if the link is internal </span> <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>pageUrl</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>#</span><span>"</span><span>)</span> <span>:</span> <span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>Internal</span><span>'</span><span>})</span> <span>internalLinks</span><span>+=</span><span>1</span> <span>#if the link is external </span> <span>else</span><span>:</span> <span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>External</span><span>'</span><span>})</span> <span>externalLinks</span><span>+=</span><span>1</span> <span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>csvfile</span><span>)</span> <span>writer</span><span>.</span><span>writerow</span><span>([</span><span>"</span><span>Total Internal Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>internalLinks</span><span>}</span><span>"</span><span>,</span> <span>"</span><span>Total External Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>externalLinks</span><span>}</span><span>"</span><span>])</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>The page </span><span>{</span><span>url</span><span>}</span><span> has </span><span>{</span><span>internalLinks</span><span>}</span><span> Internal Link(s) and </span><span>{</span><span>externalLinks</span><span>}</span><span> External Link(s)</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>And data has been saved in the </span><span>{</span><span>filename</span><span>}</span><span>"</span><span>)</span>def extIntLinks(allLinks, pageUrl): #filename currentTime = datetime.datetime.now() #create a unique .csv file name using the datetime module filename = f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv" with open(filename, 'w', newline='') as csvfile: fieldnames = ['Tested Url','Link', 'Type'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() internalLinks = 0 externalLinks = 0 #go through all the <a> elements list for anchor in allLinks: link = anchor.get("href") #get the link from the <a> element #check if the link is internal if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") : writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'}) internalLinks+=1 #if the link is external else: writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'}) externalLinks+=1 writer = csv.writer(csvfile) writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"]) print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)") print(f"And data has been saved in the {filename}")
Enter fullscreen mode Exit fullscreen mode
The complete Program:
Now we can put the complete program altogether and run it.
<span>import</span> <span>requests</span><span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span><span>import</span> <span>csv</span><span>import</span> <span>datetime</span><span>def</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>):</span><span>#filename </span> <span>currentTime</span> <span>=</span> <span>datetime</span><span>.</span><span>datetime</span><span>.</span><span>now</span><span>()</span><span>#create a unique .csv file name using the datetime module </span> <span>filename</span> <span>=</span> <span>f</span><span>"</span><span>Links-</span><span>{</span><span>currentTime</span><span>.</span><span>day</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>month</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>year</span><span>}</span><span> </span><span>{</span><span>currentTime</span><span>.</span><span>hour</span><span>}{</span><span>currentTime</span><span>.</span><span>minute</span><span>}{</span><span>currentTime</span><span>.</span><span>second</span><span>}</span><span>.csv</span><span>"</span><span>with</span> <span>open</span><span>(</span><span>filename</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>,</span> <span>newline</span><span>=</span><span>''</span><span>)</span> <span>as</span> <span>csvfile</span><span>:</span><span>fieldnames</span> <span>=</span> <span>[</span><span>'</span><span>Tested Url</span><span>'</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>]</span><span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>DictWriter</span><span>(</span><span>csvfile</span><span>,</span> <span>fieldnames</span><span>=</span><span>fieldnames</span><span>)</span><span>writer</span><span>.</span><span>writeheader</span><span>()</span><span>internalLinks</span> <span>=</span> <span>0</span><span>externalLinks</span> <span>=</span> <span>0</span><span>#go through all the <a> elements list </span> <span>for</span> <span>anchor</span> <span>in</span> <span>allLinks</span><span>:</span><span>link</span> <span>=</span> <span>anchor</span><span>.</span><span>get</span><span>(</span><span>"</span><span>href</span><span>"</span><span>)</span> <span>#get the link from the <a> element </span><span>#check if the link is internal </span> <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>pageUrl</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>#</span><span>"</span><span>)</span> <span>:</span><span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>Internal</span><span>'</span><span>})</span><span>internalLinks</span><span>+=</span><span>1</span><span>#if the link is external </span> <span>else</span><span>:</span><span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>External</span><span>'</span><span>})</span><span>externalLinks</span><span>+=</span><span>1</span><span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>csvfile</span><span>)</span><span>writer</span><span>.</span><span>writerow</span><span>([</span><span>"</span><span>Total Internal Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>internalLinks</span><span>}</span><span>"</span><span>,</span> <span>"</span><span>Total External Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>externalLinks</span><span>}</span><span>"</span><span>])</span><span>print</span><span>(</span><span>f</span><span>"</span><span>The page </span><span>{</span><span>url</span><span>}</span><span> has </span><span>{</span><span>internalLinks</span><span>}</span><span> Internal Link(s) and </span><span>{</span><span>externalLinks</span><span>}</span><span> External Link(s)</span><span>"</span><span>)</span><span>print</span><span>(</span><span>f</span><span>"</span><span>And data has been saved in the </span><span>{</span><span>filename</span><span>}</span><span>"</span><span>)</span><span>#parse all the links from the web page </span><span>def</span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>):</span><span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>pageHtml</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span><span>#get all the <a> elements from the HTML page </span> <span>allLinks</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span><span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>)</span><span>#to make the HTTP request to the give url </span><span>def</span> <span>requestMaker</span><span>(</span><span>url</span><span>):</span><span>try</span><span>:</span><span>#make the get request to the url </span> <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span><span>#if the request is successful </span> <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>in</span> <span>range</span><span>(</span><span>200</span><span>,</span> <span>300</span><span>):</span><span>#extract the page html content for parsing the links </span> <span>pageHtml</span> <span>=</span> <span>response</span><span>.</span><span>text</span><span>pageUrl</span> <span>=</span> <span>response</span><span>.</span><span>url</span><span>#call the parseLink function </span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>)</span><span>else</span><span>:</span><span>print</span><span>(</span><span>"</span><span>Sorry Could not fetch the result status code {response.status_code}!</span><span>"</span><span>)</span><span>except</span><span>:</span><span>print</span><span>(</span><span>f</span><span>"</span><span>Could Not Connect to url </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span><span>if</span> <span>__name__</span> <span>==</span> <span>"</span><span>__main__</span><span>"</span><span>:</span><span>url</span> <span>=</span> <span>input</span><span>(</span><span>"</span><span>Enter the URL eg. https://example.com: </span><span>"</span><span>)</span><span>requestMaker</span><span>(</span><span>url</span><span>)</span><span>import</span> <span>requests</span> <span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span> <span>import</span> <span>csv</span> <span>import</span> <span>datetime</span> <span>def</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>):</span> <span>#filename </span> <span>currentTime</span> <span>=</span> <span>datetime</span><span>.</span><span>datetime</span><span>.</span><span>now</span><span>()</span> <span>#create a unique .csv file name using the datetime module </span> <span>filename</span> <span>=</span> <span>f</span><span>"</span><span>Links-</span><span>{</span><span>currentTime</span><span>.</span><span>day</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>month</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>year</span><span>}</span><span> </span><span>{</span><span>currentTime</span><span>.</span><span>hour</span><span>}{</span><span>currentTime</span><span>.</span><span>minute</span><span>}{</span><span>currentTime</span><span>.</span><span>second</span><span>}</span><span>.csv</span><span>"</span> <span>with</span> <span>open</span><span>(</span><span>filename</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>,</span> <span>newline</span><span>=</span><span>''</span><span>)</span> <span>as</span> <span>csvfile</span><span>:</span> <span>fieldnames</span> <span>=</span> <span>[</span><span>'</span><span>Tested Url</span><span>'</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>]</span> <span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>DictWriter</span><span>(</span><span>csvfile</span><span>,</span> <span>fieldnames</span><span>=</span><span>fieldnames</span><span>)</span> <span>writer</span><span>.</span><span>writeheader</span><span>()</span> <span>internalLinks</span> <span>=</span> <span>0</span> <span>externalLinks</span> <span>=</span> <span>0</span> <span>#go through all the <a> elements list </span> <span>for</span> <span>anchor</span> <span>in</span> <span>allLinks</span><span>:</span> <span>link</span> <span>=</span> <span>anchor</span><span>.</span><span>get</span><span>(</span><span>"</span><span>href</span><span>"</span><span>)</span> <span>#get the link from the <a> element </span> <span>#check if the link is internal </span> <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>pageUrl</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>#</span><span>"</span><span>)</span> <span>:</span> <span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>Internal</span><span>'</span><span>})</span> <span>internalLinks</span><span>+=</span><span>1</span> <span>#if the link is external </span> <span>else</span><span>:</span> <span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>External</span><span>'</span><span>})</span> <span>externalLinks</span><span>+=</span><span>1</span> <span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>csvfile</span><span>)</span> <span>writer</span><span>.</span><span>writerow</span><span>([</span><span>"</span><span>Total Internal Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>internalLinks</span><span>}</span><span>"</span><span>,</span> <span>"</span><span>Total External Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>externalLinks</span><span>}</span><span>"</span><span>])</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>The page </span><span>{</span><span>url</span><span>}</span><span> has </span><span>{</span><span>internalLinks</span><span>}</span><span> Internal Link(s) and </span><span>{</span><span>externalLinks</span><span>}</span><span> External Link(s)</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>And data has been saved in the </span><span>{</span><span>filename</span><span>}</span><span>"</span><span>)</span> <span>#parse all the links from the web page </span><span>def</span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>):</span> <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>pageHtml</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span> <span>#get all the <a> elements from the HTML page </span> <span>allLinks</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>)</span> <span>#to make the HTTP request to the give url </span><span>def</span> <span>requestMaker</span><span>(</span><span>url</span><span>):</span> <span>try</span><span>:</span> <span>#make the get request to the url </span> <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span> <span>#if the request is successful </span> <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>in</span> <span>range</span><span>(</span><span>200</span><span>,</span> <span>300</span><span>):</span> <span>#extract the page html content for parsing the links </span> <span>pageHtml</span> <span>=</span> <span>response</span><span>.</span><span>text</span> <span>pageUrl</span> <span>=</span> <span>response</span><span>.</span><span>url</span> <span>#call the parseLink function </span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>)</span> <span>else</span><span>:</span> <span>print</span><span>(</span><span>"</span><span>Sorry Could not fetch the result status code {response.status_code}!</span><span>"</span><span>)</span> <span>except</span><span>:</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>Could Not Connect to url </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span> <span>if</span> <span>__name__</span> <span>==</span> <span>"</span><span>__main__</span><span>"</span><span>:</span> <span>url</span> <span>=</span> <span>input</span><span>(</span><span>"</span><span>Enter the URL eg. https://example.com: </span><span>"</span><span>)</span> <span>requestMaker</span><span>(</span><span>url</span><span>)</span>import requests from bs4 import BeautifulSoup import csv import datetime def extIntLinks(allLinks, pageUrl): #filename currentTime = datetime.datetime.now() #create a unique .csv file name using the datetime module filename = f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv" with open(filename, 'w', newline='') as csvfile: fieldnames = ['Tested Url','Link', 'Type'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() internalLinks = 0 externalLinks = 0 #go through all the <a> elements list for anchor in allLinks: link = anchor.get("href") #get the link from the <a> element #check if the link is internal if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") : writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'}) internalLinks+=1 #if the link is external else: writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'}) externalLinks+=1 writer = csv.writer(csvfile) writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"]) print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)") print(f"And data has been saved in the {filename}") #parse all the links from the web page def parseLinks(pageHtml, pageUrl): soup = BeautifulSoup(pageHtml, 'html.parser') #get all the <a> elements from the HTML page allLinks = soup.find_all('a') extIntLinks(allLinks, pageUrl) #to make the HTTP request to the give url def requestMaker(url): try: #make the get request to the url response = requests.get(url) #if the request is successful if response.status_code in range(200, 300): #extract the page html content for parsing the links pageHtml = response.text pageUrl = response.url #call the parseLink function parseLinks(pageHtml, pageUrl) else: print("Sorry Could not fetch the result status code {response.status_code}!") except: print(f"Could Not Connect to url {url}") if __name__ == "__main__": url = input("Enter the URL eg. https://example.com: ") requestMaker(url)
Enter fullscreen mode Exit fullscreen mode
Output
Enter the URL eg. https://example.com: https://techgeekbuzz.comThe page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s)And data has been saved in the Links-16-7-2022 11644.csvEnter the URL eg. https://example.com: https://techgeekbuzz.com The page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s) And data has been saved in the Links-16-7-2022 11644.csvEnter the URL eg. https://example.com: https://techgeekbuzz.com The page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s) And data has been saved in the Links-16-7-2022 11644.csv
Enter fullscreen mode Exit fullscreen mode
The CSV File
You can also download the this code from my github
HAPPY CODING!!
原文链接:How to Check Internal and External links on a Webpage using Python
暂无评论内容