How to Check Internal and External links on a Webpage using Python

Python is one of the most popular programming languages for extracting, processing, and analyzing data. The inbuilt and third-party libraries of Python make it very easy for a developer to get specific data from a web page and make results around those data sets.
In this article, I have covered a simple Python script that can extract links from a given url of a web page and create a CSV file containing all the links present on that web page with extra information telling whether the link is external or internal.

Prerequisite

As it is a python article with the program, so it goes without saying that you need to have basic knowledge of Python and Python installed on your system to test the program for yourself.
If you are on a new system, you can easily install the latest version of Python with this quick download link.

To make the program, I will use 4 Python libraries, among which two libraries are third-party libraries, and the other two are built-in.

Libraries

1. requests:
requests is the popular python HTTP library. We will use this library to make an HTTP request for the url which links we want to check.
As requests is a third-party library, we need to install it for our Python environment using the pip command.
pip install requests

2. Beautiful soup:
Beautiful soup is a third-party Python library that can extract data from HTML and XML files. Generally, a web page is an HTML document, and we can use the Python beautiful soup to extract links from that web page.

Use the following command to install beautiful soup
pip install beautifulsoup4

3. csv
csv modules come with Python, and we can write, read and append between .csv files using this module.

4. datetime
datetime is also an inbuilt Python module that can deal with date and time.

Program

Now let’s use all these 4 Python modules and write a Program that can tell all the internal and external links of a web page and export that data into a .csv file.

I have divided this program into three functions to make it modular.

Function 1: requestMaker(url)

The requestMake(url) function accepts the url as a string and sends a get request to the url using the .get() method.
After making the request, inside the requestMaker() function, I collected the response web page HTML content and the url using the .text and .url properties.
And called the parseLinks(pageHtml, pageUrl) function.

<span>#to make the HTTP request to the give url </span><span>def</span> <span>requestMaker</span><span>(</span><span>url</span><span>):</span>
<span>try</span><span>:</span>
<span>#make the get request to the url </span> <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>
<span>#if the request is successful </span> <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>in</span> <span>range</span><span>(</span><span>200</span><span>,</span> <span>300</span><span>):</span>
<span>#extract the page html content for parsing the links </span> <span>pageHtml</span> <span>=</span> <span>response</span><span>.</span><span>text</span>
<span>pageUrl</span> <span>=</span> <span>response</span><span>.</span><span>url</span>
<span>#call the parseLink function </span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>)</span>
<span>else</span><span>:</span>
<span>print</span><span>(</span><span>"</span><span>Sorry Could not fetch the result status code {response.status_code}!</span><span>"</span><span>)</span>
<span>except</span><span>:</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>Could Not Connect to url </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>
<span>#to make the HTTP request to the give url </span><span>def</span> <span>requestMaker</span><span>(</span><span>url</span><span>):</span>
    <span>try</span><span>:</span>
        <span>#make the get request to the url </span>        <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>

        <span>#if the request is successful </span>        <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>in</span> <span>range</span><span>(</span><span>200</span><span>,</span> <span>300</span><span>):</span>
            <span>#extract the page html content for parsing the links </span>            <span>pageHtml</span> <span>=</span> <span>response</span><span>.</span><span>text</span>
            <span>pageUrl</span> <span>=</span> <span>response</span><span>.</span><span>url</span>

            <span>#call the parseLink function </span>            <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>)</span>

        <span>else</span><span>:</span>
            <span>print</span><span>(</span><span>"</span><span>Sorry Could not fetch the result status code {response.status_code}!</span><span>"</span><span>)</span>

    <span>except</span><span>:</span>
        <span>print</span><span>(</span><span>f</span><span>"</span><span>Could Not Connect to url </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>
#to make the HTTP request to the give url def requestMaker(url): try: #make the get request to the url response = requests.get(url) #if the request is successful if response.status_code in range(200, 300): #extract the page html content for parsing the links pageHtml = response.text pageUrl = response.url #call the parseLink function parseLinks(pageHtml, pageUrl) else: print("Sorry Could not fetch the result status code {response.status_code}!") except: print(f"Could Not Connect to url {url}")

Enter fullscreen mode Exit fullscreen mode

Function 2: parseLinks(pageHtml, pageUrl)

The parseLinks() function accept the pageHtml and pageUrl as string, and parse the pageHTML string using the BeautiulSoup module with HTML parser as a soup object. And with the soup object we collected a list of all the <a> tags present in the HTML page using the .find_all('a') method.
Then inside the parseLinks() function I have called the extIntLinks(allLinks, pageUrl) function.

<span>#parse all the links from the web page </span><span>def</span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>):</span>
<span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>pageHtml</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span>
<span>#get all the <a> elements from the HTML page </span> <span>allLinks</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span>
<span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>)</span>
<span>#parse all the links from the web page </span><span>def</span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>):</span>
    <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>pageHtml</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span>

    <span>#get all the <a> elements from the HTML page </span>    <span>allLinks</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span>

    <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>)</span>
#parse all the links from the web page def parseLinks(pageHtml, pageUrl): soup = BeautifulSoup(pageHtml, 'html.parser') #get all the <a> elements from the HTML page allLinks = soup.find_all('a') extIntLinks(allLinks, pageUrl)

Enter fullscreen mode Exit fullscreen mode

Function 3: extIntLinks(allLinks, pageUrl)

The extIntLinks(allLinks, pageUrl) function does the following things.

  1. Create a unique .csv file name using the datetime module.
  2. Create the unique .csv file in write mode.
  3. Loop through all the extracted <a> links
  4. Check for the internal and external links.
  5. Write the data into the csv file.
<span>def</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>):</span>
<span>#filename </span> <span>currentTime</span> <span>=</span> <span>datetime</span><span>.</span><span>datetime</span><span>.</span><span>now</span><span>()</span>
<span>#create a unique .csv file name using the datetime module </span> <span>filename</span> <span>=</span> <span>f</span><span>"</span><span>Links-</span><span>{</span><span>currentTime</span><span>.</span><span>day</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>month</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>year</span><span>}</span><span> </span><span>{</span><span>currentTime</span><span>.</span><span>hour</span><span>}{</span><span>currentTime</span><span>.</span><span>minute</span><span>}{</span><span>currentTime</span><span>.</span><span>second</span><span>}</span><span>.csv</span><span>"</span>
<span>with</span> <span>open</span><span>(</span><span>filename</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>,</span> <span>newline</span><span>=</span><span>''</span><span>)</span> <span>as</span> <span>csvfile</span><span>:</span>
<span>fieldnames</span> <span>=</span> <span>[</span><span>'</span><span>Tested Url</span><span>'</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>]</span>
<span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>DictWriter</span><span>(</span><span>csvfile</span><span>,</span> <span>fieldnames</span><span>=</span><span>fieldnames</span><span>)</span>
<span>writer</span><span>.</span><span>writeheader</span><span>()</span>
<span>internalLinks</span> <span>=</span> <span>0</span>
<span>externalLinks</span> <span>=</span> <span>0</span>
<span>#go through all the <a> elements list </span> <span>for</span> <span>anchor</span> <span>in</span> <span>allLinks</span><span>:</span>
<span>link</span> <span>=</span> <span>anchor</span><span>.</span><span>get</span><span>(</span><span>"</span><span>href</span><span>"</span><span>)</span> <span>#get the link from the <a> element </span>
<span>#check if the link is internal </span> <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>pageUrl</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>#</span><span>"</span><span>)</span> <span>:</span>
<span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>Internal</span><span>'</span><span>})</span>
<span>internalLinks</span><span>+=</span><span>1</span>
<span>#if the link is external </span> <span>else</span><span>:</span>
<span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>External</span><span>'</span><span>})</span>
<span>externalLinks</span><span>+=</span><span>1</span>
<span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>csvfile</span><span>)</span>
<span>writer</span><span>.</span><span>writerow</span><span>([</span><span>"</span><span>Total Internal Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>internalLinks</span><span>}</span><span>"</span><span>,</span> <span>"</span><span>Total External Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>externalLinks</span><span>}</span><span>"</span><span>])</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>The page </span><span>{</span><span>url</span><span>}</span><span> has </span><span>{</span><span>internalLinks</span><span>}</span><span> Internal Link(s) and </span><span>{</span><span>externalLinks</span><span>}</span><span> External Link(s)</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>And data has been saved in the </span><span>{</span><span>filename</span><span>}</span><span>"</span><span>)</span>
<span>def</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>):</span>
    <span>#filename </span>    <span>currentTime</span> <span>=</span> <span>datetime</span><span>.</span><span>datetime</span><span>.</span><span>now</span><span>()</span>
    <span>#create a unique .csv file name using the datetime module </span>    <span>filename</span> <span>=</span>  <span>f</span><span>"</span><span>Links-</span><span>{</span><span>currentTime</span><span>.</span><span>day</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>month</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>year</span><span>}</span><span> </span><span>{</span><span>currentTime</span><span>.</span><span>hour</span><span>}{</span><span>currentTime</span><span>.</span><span>minute</span><span>}{</span><span>currentTime</span><span>.</span><span>second</span><span>}</span><span>.csv</span><span>"</span>

    <span>with</span> <span>open</span><span>(</span><span>filename</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>,</span> <span>newline</span><span>=</span><span>''</span><span>)</span> <span>as</span> <span>csvfile</span><span>:</span>
        <span>fieldnames</span> <span>=</span> <span>[</span><span>'</span><span>Tested Url</span><span>'</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>]</span>

        <span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>DictWriter</span><span>(</span><span>csvfile</span><span>,</span> <span>fieldnames</span><span>=</span><span>fieldnames</span><span>)</span>
        <span>writer</span><span>.</span><span>writeheader</span><span>()</span>

        <span>internalLinks</span> <span>=</span> <span>0</span>
        <span>externalLinks</span> <span>=</span> <span>0</span> 

        <span>#go through all the <a> elements list </span>        <span>for</span> <span>anchor</span> <span>in</span> <span>allLinks</span><span>:</span>
            <span>link</span> <span>=</span> <span>anchor</span><span>.</span><span>get</span><span>(</span><span>"</span><span>href</span><span>"</span><span>)</span>   <span>#get the link from the <a> element </span>
            <span>#check if the link is internal </span>            <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>pageUrl</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>#</span><span>"</span><span>)</span> <span>:</span>
                <span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>Internal</span><span>'</span><span>})</span>
                <span>internalLinks</span><span>+=</span><span>1</span>
            <span>#if the link is external </span>            <span>else</span><span>:</span>
                <span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>External</span><span>'</span><span>})</span>
                <span>externalLinks</span><span>+=</span><span>1</span>
        <span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>csvfile</span><span>)</span>
        <span>writer</span><span>.</span><span>writerow</span><span>([</span><span>"</span><span>Total Internal Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>internalLinks</span><span>}</span><span>"</span><span>,</span> <span>"</span><span>Total External Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>externalLinks</span><span>}</span><span>"</span><span>])</span>

        <span>print</span><span>(</span><span>f</span><span>"</span><span>The page </span><span>{</span><span>url</span><span>}</span><span> has </span><span>{</span><span>internalLinks</span><span>}</span><span> Internal Link(s) and </span><span>{</span><span>externalLinks</span><span>}</span><span> External Link(s)</span><span>"</span><span>)</span>
        <span>print</span><span>(</span><span>f</span><span>"</span><span>And data has been saved in the </span><span>{</span><span>filename</span><span>}</span><span>"</span><span>)</span>
def extIntLinks(allLinks, pageUrl): #filename currentTime = datetime.datetime.now() #create a unique .csv file name using the datetime module filename = f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv" with open(filename, 'w', newline='') as csvfile: fieldnames = ['Tested Url','Link', 'Type'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() internalLinks = 0 externalLinks = 0 #go through all the <a> elements list for anchor in allLinks: link = anchor.get("href") #get the link from the <a> element #check if the link is internal if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") : writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'}) internalLinks+=1 #if the link is external else: writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'}) externalLinks+=1 writer = csv.writer(csvfile) writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"]) print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)") print(f"And data has been saved in the {filename}")

Enter fullscreen mode Exit fullscreen mode

The complete Program:

Now we can put the complete program altogether and run it.

<span>import</span> <span>requests</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>
<span>import</span> <span>csv</span>
<span>import</span> <span>datetime</span>
<span>def</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>):</span>
<span>#filename </span> <span>currentTime</span> <span>=</span> <span>datetime</span><span>.</span><span>datetime</span><span>.</span><span>now</span><span>()</span>
<span>#create a unique .csv file name using the datetime module </span> <span>filename</span> <span>=</span> <span>f</span><span>"</span><span>Links-</span><span>{</span><span>currentTime</span><span>.</span><span>day</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>month</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>year</span><span>}</span><span> </span><span>{</span><span>currentTime</span><span>.</span><span>hour</span><span>}{</span><span>currentTime</span><span>.</span><span>minute</span><span>}{</span><span>currentTime</span><span>.</span><span>second</span><span>}</span><span>.csv</span><span>"</span>
<span>with</span> <span>open</span><span>(</span><span>filename</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>,</span> <span>newline</span><span>=</span><span>''</span><span>)</span> <span>as</span> <span>csvfile</span><span>:</span>
<span>fieldnames</span> <span>=</span> <span>[</span><span>'</span><span>Tested Url</span><span>'</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>]</span>
<span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>DictWriter</span><span>(</span><span>csvfile</span><span>,</span> <span>fieldnames</span><span>=</span><span>fieldnames</span><span>)</span>
<span>writer</span><span>.</span><span>writeheader</span><span>()</span>
<span>internalLinks</span> <span>=</span> <span>0</span>
<span>externalLinks</span> <span>=</span> <span>0</span>
<span>#go through all the <a> elements list </span> <span>for</span> <span>anchor</span> <span>in</span> <span>allLinks</span><span>:</span>
<span>link</span> <span>=</span> <span>anchor</span><span>.</span><span>get</span><span>(</span><span>"</span><span>href</span><span>"</span><span>)</span> <span>#get the link from the <a> element </span>
<span>#check if the link is internal </span> <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>pageUrl</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>#</span><span>"</span><span>)</span> <span>:</span>
<span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>Internal</span><span>'</span><span>})</span>
<span>internalLinks</span><span>+=</span><span>1</span>
<span>#if the link is external </span> <span>else</span><span>:</span>
<span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>External</span><span>'</span><span>})</span>
<span>externalLinks</span><span>+=</span><span>1</span>
<span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>csvfile</span><span>)</span>
<span>writer</span><span>.</span><span>writerow</span><span>([</span><span>"</span><span>Total Internal Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>internalLinks</span><span>}</span><span>"</span><span>,</span> <span>"</span><span>Total External Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>externalLinks</span><span>}</span><span>"</span><span>])</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>The page </span><span>{</span><span>url</span><span>}</span><span> has </span><span>{</span><span>internalLinks</span><span>}</span><span> Internal Link(s) and </span><span>{</span><span>externalLinks</span><span>}</span><span> External Link(s)</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>And data has been saved in the </span><span>{</span><span>filename</span><span>}</span><span>"</span><span>)</span>
<span>#parse all the links from the web page </span><span>def</span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>):</span>
<span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>pageHtml</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span>
<span>#get all the <a> elements from the HTML page </span> <span>allLinks</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span>
<span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>)</span>
<span>#to make the HTTP request to the give url </span><span>def</span> <span>requestMaker</span><span>(</span><span>url</span><span>):</span>
<span>try</span><span>:</span>
<span>#make the get request to the url </span> <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>
<span>#if the request is successful </span> <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>in</span> <span>range</span><span>(</span><span>200</span><span>,</span> <span>300</span><span>):</span>
<span>#extract the page html content for parsing the links </span> <span>pageHtml</span> <span>=</span> <span>response</span><span>.</span><span>text</span>
<span>pageUrl</span> <span>=</span> <span>response</span><span>.</span><span>url</span>
<span>#call the parseLink function </span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>)</span>
<span>else</span><span>:</span>
<span>print</span><span>(</span><span>"</span><span>Sorry Could not fetch the result status code {response.status_code}!</span><span>"</span><span>)</span>
<span>except</span><span>:</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>Could Not Connect to url </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>
<span>if</span> <span>__name__</span> <span>==</span> <span>"</span><span>__main__</span><span>"</span><span>:</span>
<span>url</span> <span>=</span> <span>input</span><span>(</span><span>"</span><span>Enter the URL eg. https://example.com: </span><span>"</span><span>)</span>
<span>requestMaker</span><span>(</span><span>url</span><span>)</span>
<span>import</span> <span>requests</span>
<span>from</span> <span>bs4</span> <span>import</span> <span>BeautifulSoup</span>
<span>import</span> <span>csv</span>
<span>import</span> <span>datetime</span> 


<span>def</span> <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>):</span>
    <span>#filename </span>    <span>currentTime</span> <span>=</span> <span>datetime</span><span>.</span><span>datetime</span><span>.</span><span>now</span><span>()</span>
    <span>#create a unique .csv file name using the datetime module </span>    <span>filename</span> <span>=</span>  <span>f</span><span>"</span><span>Links-</span><span>{</span><span>currentTime</span><span>.</span><span>day</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>month</span><span>}</span><span>-</span><span>{</span><span>currentTime</span><span>.</span><span>year</span><span>}</span><span> </span><span>{</span><span>currentTime</span><span>.</span><span>hour</span><span>}{</span><span>currentTime</span><span>.</span><span>minute</span><span>}{</span><span>currentTime</span><span>.</span><span>second</span><span>}</span><span>.csv</span><span>"</span>

    <span>with</span> <span>open</span><span>(</span><span>filename</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>,</span> <span>newline</span><span>=</span><span>''</span><span>)</span> <span>as</span> <span>csvfile</span><span>:</span>
        <span>fieldnames</span> <span>=</span> <span>[</span><span>'</span><span>Tested Url</span><span>'</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>]</span>

        <span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>DictWriter</span><span>(</span><span>csvfile</span><span>,</span> <span>fieldnames</span><span>=</span><span>fieldnames</span><span>)</span>
        <span>writer</span><span>.</span><span>writeheader</span><span>()</span>

        <span>internalLinks</span> <span>=</span> <span>0</span>
        <span>externalLinks</span> <span>=</span> <span>0</span> 

        <span>#go through all the <a> elements list </span>        <span>for</span> <span>anchor</span> <span>in</span> <span>allLinks</span><span>:</span>
            <span>link</span> <span>=</span> <span>anchor</span><span>.</span><span>get</span><span>(</span><span>"</span><span>href</span><span>"</span><span>)</span>   <span>#get the link from the <a> element </span>
            <span>#check if the link is internal </span>            <span>if</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>pageUrl</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>link</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>#</span><span>"</span><span>)</span> <span>:</span>
                <span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>Internal</span><span>'</span><span>})</span>
                <span>internalLinks</span><span>+=</span><span>1</span>
            <span>#if the link is external </span>            <span>else</span><span>:</span>
                <span>writer</span><span>.</span><span>writerow</span><span>({</span><span>'</span><span>Tested Url</span><span>'</span><span>:</span><span>pageUrl</span><span>,</span><span>'</span><span>Link</span><span>'</span><span>:</span> <span>link</span><span>,</span> <span>'</span><span>Type</span><span>'</span><span>:</span> <span>'</span><span>External</span><span>'</span><span>})</span>
                <span>externalLinks</span><span>+=</span><span>1</span>
        <span>writer</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>csvfile</span><span>)</span>
        <span>writer</span><span>.</span><span>writerow</span><span>([</span><span>"</span><span>Total Internal Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>internalLinks</span><span>}</span><span>"</span><span>,</span> <span>"</span><span>Total External Links</span><span>"</span><span>,</span> <span>f</span><span>"</span><span>{</span><span>externalLinks</span><span>}</span><span>"</span><span>])</span>

        <span>print</span><span>(</span><span>f</span><span>"</span><span>The page </span><span>{</span><span>url</span><span>}</span><span> has </span><span>{</span><span>internalLinks</span><span>}</span><span> Internal Link(s) and </span><span>{</span><span>externalLinks</span><span>}</span><span> External Link(s)</span><span>"</span><span>)</span>
        <span>print</span><span>(</span><span>f</span><span>"</span><span>And data has been saved in the </span><span>{</span><span>filename</span><span>}</span><span>"</span><span>)</span>


<span>#parse all the links from the web page </span><span>def</span> <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>):</span>
    <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>pageHtml</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span>

    <span>#get all the <a> elements from the HTML page </span>    <span>allLinks</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span>

    <span>extIntLinks</span><span>(</span><span>allLinks</span><span>,</span> <span>pageUrl</span><span>)</span>

<span>#to make the HTTP request to the give url </span><span>def</span> <span>requestMaker</span><span>(</span><span>url</span><span>):</span>
    <span>try</span><span>:</span>
        <span>#make the get request to the url </span>        <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>)</span>

        <span>#if the request is successful </span>        <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>in</span> <span>range</span><span>(</span><span>200</span><span>,</span> <span>300</span><span>):</span>
            <span>#extract the page html content for parsing the links </span>            <span>pageHtml</span> <span>=</span> <span>response</span><span>.</span><span>text</span>
            <span>pageUrl</span> <span>=</span> <span>response</span><span>.</span><span>url</span>

            <span>#call the parseLink function </span>            <span>parseLinks</span><span>(</span><span>pageHtml</span><span>,</span> <span>pageUrl</span><span>)</span>

        <span>else</span><span>:</span>
            <span>print</span><span>(</span><span>"</span><span>Sorry Could not fetch the result status code {response.status_code}!</span><span>"</span><span>)</span>

    <span>except</span><span>:</span>
        <span>print</span><span>(</span><span>f</span><span>"</span><span>Could Not Connect to url </span><span>{</span><span>url</span><span>}</span><span>"</span><span>)</span>



<span>if</span> <span>__name__</span> <span>==</span> <span>"</span><span>__main__</span><span>"</span><span>:</span>
    <span>url</span> <span>=</span> <span>input</span><span>(</span><span>"</span><span>Enter the URL eg. https://example.com: </span><span>"</span><span>)</span>
    <span>requestMaker</span><span>(</span><span>url</span><span>)</span>
import requests from bs4 import BeautifulSoup import csv import datetime def extIntLinks(allLinks, pageUrl): #filename currentTime = datetime.datetime.now() #create a unique .csv file name using the datetime module filename = f"Links-{currentTime.day}-{currentTime.month}-{currentTime.year} {currentTime.hour}{currentTime.minute}{currentTime.second}.csv" with open(filename, 'w', newline='') as csvfile: fieldnames = ['Tested Url','Link', 'Type'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() internalLinks = 0 externalLinks = 0 #go through all the <a> elements list for anchor in allLinks: link = anchor.get("href") #get the link from the <a> element #check if the link is internal if link.startswith(pageUrl) or link.startswith("/") or link.startswith("#") : writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'Internal'}) internalLinks+=1 #if the link is external else: writer.writerow({'Tested Url':pageUrl,'Link': link, 'Type': 'External'}) externalLinks+=1 writer = csv.writer(csvfile) writer.writerow(["Total Internal Links", f"{internalLinks}", "Total External Links", f"{externalLinks}"]) print(f"The page {url} has {internalLinks} Internal Link(s) and {externalLinks} External Link(s)") print(f"And data has been saved in the {filename}") #parse all the links from the web page def parseLinks(pageHtml, pageUrl): soup = BeautifulSoup(pageHtml, 'html.parser') #get all the <a> elements from the HTML page allLinks = soup.find_all('a') extIntLinks(allLinks, pageUrl) #to make the HTTP request to the give url def requestMaker(url): try: #make the get request to the url response = requests.get(url) #if the request is successful if response.status_code in range(200, 300): #extract the page html content for parsing the links pageHtml = response.text pageUrl = response.url #call the parseLink function parseLinks(pageHtml, pageUrl) else: print("Sorry Could not fetch the result status code {response.status_code}!") except: print(f"Could Not Connect to url {url}") if __name__ == "__main__": url = input("Enter the URL eg. https://example.com: ") requestMaker(url)

Enter fullscreen mode Exit fullscreen mode

Output

Enter the URL eg. https://example.com: https://techgeekbuzz.com
The page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s)
And data has been saved in the Links-16-7-2022 11644.csv
Enter the URL eg. https://example.com:  https://techgeekbuzz.com
The page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s)
And data has been saved in the Links-16-7-2022 11644.csv
Enter the URL eg. https://example.com: https://techgeekbuzz.com The page https://techgeekbuzz.com has 126 Internal Link(s) and 7 External Link(s) And data has been saved in the Links-16-7-2022 11644.csv

Enter fullscreen mode Exit fullscreen mode

The CSV File

You can also download the this code from my github

HAPPY CODING!!

原文链接:How to Check Internal and External links on a Webpage using Python

© 版权声明
THE END
喜欢就支持一下吧
点赞15 分享
Making the absolute best of ourselves is not an easy task. It is a pleasurable pursuit...but it requires patience, persistence, and perseverance.
做最好的自己并不容易,这是很美好的愿望,需要耐心、坚持和毅力
评论 抢沙发

请登录后发表评论

    暂无评论内容