Have you ever needed up-to-date job listings but struggled to find one clean source? In this project, I built a full ETL (Extract, Transform, Load) pipeline using Python to scrape internship job data from websites, clean and process it, and load it into an Excel file for analysis or tracking.
This project is perfect for anyone interested in web scraping, data pipelines, or automating jobs and tasks with Python and cron!
The project’s GitHub link is: GitHub
What the Project Entails
The pipeline performs three main tasks:
-
Extract
- Scrapes internship listings from MyJobMag Kenya
- Applies filters like:
q=Internship
location=Nairobi
¤tpage={page_number}
-
Transform
- Cleans the data using:
-
BeautifulSoup
for HTML parsing -
pandas
for data manipulation
-
- Tasks performed:
- Removes malformed descriptions
- Drops duplicate entries
- Filters out rows with missing or invalid data
- Cleans the data using:
-
Load
- Saves the cleaned data into an Excel file:
internships.xlsx
- Useful for:
- Job search organization
- Data analysis
- Dashboards and reporting
- Saves the cleaned data into an Excel file:
Step by Step Explanation.
a. Extract.
Use requests
to make GET requests to the website’s server and beautifulsoup4
to scrape content from the website.
Also, use dotenv
to hide environmental variables like the website’s URL.
To install these packages, run the following command in your terminal:
pip <span>install </span>beautifulsoup4 requests dotenvpip <span>install </span>beautifulsoup4 requests dotenvpip install beautifulsoup4 requests dotenv
Enter fullscreen mode Exit fullscreen mode
Build a script that extracts the job’s title, description, posted date, and link to follow when applying to the internship/job.
Import the necessary libraries required and initialize empty lists to contain the various column data.
<span>import</span> <span>os</span><span>import</span> <span>json</span><span>import</span> <span>requests</span><span>from</span> <span>dotenv</span> <span>import</span> <span>load_dotenv</span><span>internships</span> <span>=</span> <span>[]</span><span>titles</span> <span>=</span> <span>[]</span><span>descriptions</span> <span>=</span> <span>[]</span><span>opened_dates</span> <span>=</span> <span>[]</span><span>links</span> <span>=</span> <span>[]</span><span>import</span> <span>os</span> <span>import</span> <span>json</span> <span>import</span> <span>requests</span> <span>from</span> <span>dotenv</span> <span>import</span> <span>load_dotenv</span> <span>internships</span> <span>=</span> <span>[]</span> <span>titles</span> <span>=</span> <span>[]</span> <span>descriptions</span> <span>=</span> <span>[]</span> <span>opened_dates</span> <span>=</span> <span>[]</span> <span>links</span> <span>=</span> <span>[]</span>import os import json import requests from dotenv import load_dotenv internships = [] titles = [] descriptions = [] opened_dates = [] links = []
Enter fullscreen mode Exit fullscreen mode
The request gets the content from the server by looping through the pages and passes it into the variable soup
. Then, loop through the listings found by soup’s find_all()
which finds HTML elements by its class.
<span>def</span> <span>extract</span><span>():</span><span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>1</span><span>,</span> <span>6</span><span>):</span><span>url</span> <span>=</span> <span>f</span><span>'</span><span>{</span><span>URL</span><span>}</span><span>¤tpage=</span><span>{</span><span>i</span><span>}</span><span>'</span><span>headers</span> <span>=</span> <span>{</span><span>"</span><span>User-Agent</span><span>"</span><span>:</span> <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36</span><span>"</span><span>}</span><span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>)</span><span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>==</span> <span>200</span><span>:</span><span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>response</span><span>.</span><span>content</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span><span>listings</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>li</span><span>'</span><span>,</span> <span>class_</span><span>=</span><span>'</span><span>job-list-li</span><span>'</span><span>)</span><span>for</span> <span>listing</span> <span>in</span> <span>listings</span><span>:</span><span>title</span> <span>=</span> <span>listing</span><span>.</span><span>find</span><span>(</span><span>'</span><span>h2</span><span>'</span><span>)</span><span>title_text</span> <span>=</span> <span>clean_text</span><span>(</span><span>title</span><span>.</span><span>get_text</span><span>())</span> <span>if</span> <span>title</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span><span>titles</span><span>.</span><span>append</span><span>(</span><span>title_text</span><span>)</span><span>a_tag</span> <span>=</span> <span>listing</span><span>.</span><span>find</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span><span>link</span> <span>=</span> <span>a_tag</span><span>[</span><span>'</span><span>href</span><span>'</span><span>]</span> <span>if</span> <span>a_tag</span> <span>and</span> <span>a_tag</span><span>.</span><span>has_attr</span><span>(</span><span>'</span><span>href</span><span>'</span><span>)</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span><span>job_links</span> <span>=</span> <span>f</span><span>'</span><span>https://myjobmag.co.ke</span><span>{</span><span>link</span><span>}</span><span>'</span> <span>if</span> <span>a_tag</span> <span>and</span> <span>a_tag</span><span>.</span><span>has_attr</span><span>(</span><span>'</span><span>href</span><span>'</span><span>)</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span><span>links</span><span>.</span><span>append</span><span>(</span><span>job_links</span><span>)</span><span>description</span> <span>=</span> <span>listing</span><span>.</span><span>find</span><span>(</span><span>'</span><span>li</span><span>'</span><span>,</span> <span>class_</span><span>=</span><span>'</span><span>job-desc</span><span>'</span><span>)</span><span>description_text</span> <span>=</span> <span>clean_text</span><span>(</span><span>description</span><span>.</span><span>get_text</span><span>())</span> <span>if</span> <span>description</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span><span>descriptions</span><span>.</span><span>append</span><span>(</span><span>description_text</span><span>)</span><span>opened_at</span> <span>=</span> <span>listing</span><span>.</span><span>find</span><span>(</span><span>'</span><span>li</span><span>'</span><span>,</span> <span>id</span><span>=</span><span>'</span><span>job-date</span><span>'</span><span>)</span><span>opened_text</span> <span>=</span> <span>opened_at</span><span>.</span><span>get_text</span><span>(</span><span>strip</span><span>=</span><span>True</span><span>)</span> <span>if</span> <span>opened_at</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span><span>opened_dates</span><span>.</span><span>append</span><span>(</span><span>opened_text</span><span>)</span><span>time</span><span>.</span><span>sleep</span><span>(</span><span>response</span><span>.</span><span>elapsed</span><span>.</span><span>total_seconds</span><span>())</span><span>def</span> <span>extract</span><span>():</span> <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>1</span><span>,</span> <span>6</span><span>):</span> <span>url</span> <span>=</span> <span>f</span><span>'</span><span>{</span><span>URL</span><span>}</span><span>¤tpage=</span><span>{</span><span>i</span><span>}</span><span>'</span> <span>headers</span> <span>=</span> <span>{</span> <span>"</span><span>User-Agent</span><span>"</span><span>:</span> <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36</span><span>"</span> <span>}</span> <span>response</span> <span>=</span> <span>requests</span><span>.</span><span>get</span><span>(</span><span>url</span><span>,</span> <span>headers</span><span>=</span><span>headers</span><span>)</span> <span>if</span> <span>response</span><span>.</span><span>status_code</span> <span>==</span> <span>200</span><span>:</span> <span>soup</span> <span>=</span> <span>BeautifulSoup</span><span>(</span><span>response</span><span>.</span><span>content</span><span>,</span> <span>'</span><span>html.parser</span><span>'</span><span>)</span> <span>listings</span> <span>=</span> <span>soup</span><span>.</span><span>find_all</span><span>(</span><span>'</span><span>li</span><span>'</span><span>,</span> <span>class_</span><span>=</span><span>'</span><span>job-list-li</span><span>'</span><span>)</span> <span>for</span> <span>listing</span> <span>in</span> <span>listings</span><span>:</span> <span>title</span> <span>=</span> <span>listing</span><span>.</span><span>find</span><span>(</span><span>'</span><span>h2</span><span>'</span><span>)</span> <span>title_text</span> <span>=</span> <span>clean_text</span><span>(</span><span>title</span><span>.</span><span>get_text</span><span>())</span> <span>if</span> <span>title</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span> <span>titles</span><span>.</span><span>append</span><span>(</span><span>title_text</span><span>)</span> <span>a_tag</span> <span>=</span> <span>listing</span><span>.</span><span>find</span><span>(</span><span>'</span><span>a</span><span>'</span><span>)</span> <span>link</span> <span>=</span> <span>a_tag</span><span>[</span><span>'</span><span>href</span><span>'</span><span>]</span> <span>if</span> <span>a_tag</span> <span>and</span> <span>a_tag</span><span>.</span><span>has_attr</span><span>(</span><span>'</span><span>href</span><span>'</span><span>)</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span> <span>job_links</span> <span>=</span> <span>f</span><span>'</span><span>https://myjobmag.co.ke</span><span>{</span><span>link</span><span>}</span><span>'</span> <span>if</span> <span>a_tag</span> <span>and</span> <span>a_tag</span><span>.</span><span>has_attr</span><span>(</span><span>'</span><span>href</span><span>'</span><span>)</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span> <span>links</span><span>.</span><span>append</span><span>(</span><span>job_links</span><span>)</span> <span>description</span> <span>=</span> <span>listing</span><span>.</span><span>find</span><span>(</span><span>'</span><span>li</span><span>'</span><span>,</span> <span>class_</span><span>=</span><span>'</span><span>job-desc</span><span>'</span><span>)</span> <span>description_text</span> <span>=</span> <span>clean_text</span><span>(</span><span>description</span><span>.</span><span>get_text</span><span>())</span> <span>if</span> <span>description</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span> <span>descriptions</span><span>.</span><span>append</span><span>(</span><span>description_text</span><span>)</span> <span>opened_at</span> <span>=</span> <span>listing</span><span>.</span><span>find</span><span>(</span><span>'</span><span>li</span><span>'</span><span>,</span> <span>id</span><span>=</span><span>'</span><span>job-date</span><span>'</span><span>)</span> <span>opened_text</span> <span>=</span> <span>opened_at</span><span>.</span><span>get_text</span><span>(</span><span>strip</span><span>=</span><span>True</span><span>)</span> <span>if</span> <span>opened_at</span> <span>else</span> <span>'</span><span>N/A</span><span>'</span> <span>opened_dates</span><span>.</span><span>append</span><span>(</span><span>opened_text</span><span>)</span> <span>time</span><span>.</span><span>sleep</span><span>(</span><span>response</span><span>.</span><span>elapsed</span><span>.</span><span>total_seconds</span><span>())</span>def extract(): for i in range(1, 6): url = f'{URL}¤tpage={i}' headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" } response = requests.get(url, headers=headers) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') listings = soup.find_all('li', class_='job-list-li') for listing in listings: title = listing.find('h2') title_text = clean_text(title.get_text()) if title else 'N/A' titles.append(title_text) a_tag = listing.find('a') link = a_tag['href'] if a_tag and a_tag.has_attr('href') else 'N/A' job_links = f'https://myjobmag.co.ke{link}' if a_tag and a_tag.has_attr('href') else 'N/A' links.append(job_links) description = listing.find('li', class_='job-desc') description_text = clean_text(description.get_text()) if description else 'N/A' descriptions.append(description_text) opened_at = listing.find('li', id='job-date') opened_text = opened_at.get_text(strip=True) if opened_at else 'N/A' opened_dates.append(opened_text) time.sleep(response.elapsed.total_seconds())
Enter fullscreen mode Exit fullscreen mode
HINT: Using time.sleep(response.elapsed.total_seconds())
is a good practice to use when sending requests to a server as it lets the request take some time before sending another request. This is respectful to do to avoid sending too many requests to the server at a time and prevents crashing.
To clean this data collected from scraping, load the data into a JSON file using the json
module.
<span>def</span> <span>load_data</span><span>(</span><span>list</span><span>):</span><span>with</span> <span>open</span><span>(</span><span>'</span><span>data.json</span><span>'</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span><span>json</span><span>.</span><span>dump</span><span>(</span><span>list</span><span>,</span> <span>f</span><span>,</span> <span>indent</span><span>=</span><span>4</span><span>)</span><span>print</span><span>(</span><span>'</span><span>Data loaded into JSON successfully</span><span>'</span><span>)</span><span>def</span> <span>load_data</span><span>(</span><span>list</span><span>):</span> <span>with</span> <span>open</span><span>(</span><span>'</span><span>data.json</span><span>'</span><span>,</span> <span>'</span><span>w</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span> <span>json</span><span>.</span><span>dump</span><span>(</span><span>list</span><span>,</span> <span>f</span><span>,</span> <span>indent</span><span>=</span><span>4</span><span>)</span> <span>print</span><span>(</span><span>'</span><span>Data loaded into JSON successfully</span><span>'</span><span>)</span>def load_data(list): with open('data.json', 'w') as f: json.dump(list, f, indent=4) print('Data loaded into JSON successfully')
Enter fullscreen mode Exit fullscreen mode
To run this script:
<span>if</span> <span>__name__</span> <span>==</span> <span>'</span><span>__main__</span><span>'</span><span>:</span><span>extract</span><span>()</span><span>load_data</span><span>(</span><span>internships</span><span>)</span><span>if</span> <span>__name__</span> <span>==</span> <span>'</span><span>__main__</span><span>'</span><span>:</span> <span>extract</span><span>()</span> <span>load_data</span><span>(</span><span>internships</span><span>)</span>if __name__ == '__main__': extract() load_data(internships)
Enter fullscreen mode Exit fullscreen mode
HINT: Using this pattern if __name__ == '__main__'
helps us to run our scripts within other scripts without encountering any errors and helps in testing our scripts elsewhere.
b. Transform/Clean.
The content collected from the website had some unwanted characters like Unicode representations e.g. \u2023
Create a function clean_text
that cleans these characters off the content. Check the extract() function at title and description.
<span>def</span> <span>clean_text</span><span>(</span><span>text</span><span>):</span><span>if</span> <span>not</span> <span>text</span><span>:</span><span>return</span> <span>'</span><span>N/A</span><span>'</span><span>text</span> <span>=</span> <span>html</span><span>.</span><span>unescape</span><span>(</span><span>text</span><span>)</span><span>text</span> <span>=</span> <span>text</span><span>.</span><span>replace</span><span>(</span><span>'</span><span>\r</span><span>'</span><span>,</span> <span>''</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\n</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\t</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\u00a0</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\u2019</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\u2023</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\u2013</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>)</span><span>text</span> <span>=</span> <span>'</span><span> </span><span>'</span><span>.</span><span>join</span><span>(</span><span>text</span><span>.</span><span>split</span><span>())</span><span>return</span> <span>text</span><span>.</span><span>strip</span><span>()</span><span>def</span> <span>clean_text</span><span>(</span><span>text</span><span>):</span> <span>if</span> <span>not</span> <span>text</span><span>:</span> <span>return</span> <span>'</span><span>N/A</span><span>'</span> <span>text</span> <span>=</span> <span>html</span><span>.</span><span>unescape</span><span>(</span><span>text</span><span>)</span> <span>text</span> <span>=</span> <span>text</span><span>.</span><span>replace</span><span>(</span><span>'</span><span>\r</span><span>'</span><span>,</span> <span>''</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\n</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\t</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\u00a0</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\u2019</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\u2023</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>).</span><span>replace</span><span>(</span><span>'</span><span>\u2013</span><span>'</span><span>,</span> <span>'</span><span> </span><span>'</span><span>)</span> <span>text</span> <span>=</span> <span>'</span><span> </span><span>'</span><span>.</span><span>join</span><span>(</span><span>text</span><span>.</span><span>split</span><span>())</span> <span>return</span> <span>text</span><span>.</span><span>strip</span><span>()</span>def clean_text(text): if not text: return 'N/A' text = html.unescape(text) text = text.replace('\r', '').replace('\n', ' ').replace('\t', ' ').replace('\u00a0', ' ').replace('\u2019', ' ').replace('\u2023', ' ').replace('\u2013', ' ') text = ' '.join(text.split()) return text.strip()
Enter fullscreen mode Exit fullscreen mode
Use the pandas
and numpy
to build a script that cleans this data and loads it into an Excel sheet.
Load the data from JSON format into a Pandas dataframe:
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>read_json</span><span>(</span><span>'</span><span>data.json</span><span>'</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>read_json</span><span>(</span><span>'</span><span>data.json</span><span>'</span><span>)</span>import pandas as pd import numpy as np df = pd.read_json('data.json')
Enter fullscreen mode Exit fullscreen mode
When printing df, the columns’ data is in a Python list, which we don’t want it to be in. We need to explode the columns into rows and reset the index.
<span>df</span> <span>=</span> <span>df</span><span>.</span><span>explode</span><span>([</span><span>'</span><span>title</span><span>'</span><span>,</span> <span>'</span><span>description</span><span>'</span><span>,</span> <span>'</span><span>opened_on</span><span>'</span><span>,</span> <span>'</span><span>link</span><span>'</span><span>]).</span><span>reset_index</span><span>(</span><span>drop</span><span>=</span><span>True</span><span>)</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>explode</span><span>([</span><span>'</span><span>title</span><span>'</span><span>,</span> <span>'</span><span>description</span><span>'</span><span>,</span> <span>'</span><span>opened_on</span><span>'</span><span>,</span> <span>'</span><span>link</span><span>'</span><span>]).</span><span>reset_index</span><span>(</span><span>drop</span><span>=</span><span>True</span><span>)</span>df = df.explode(['title', 'description', 'opened_on', 'link']).reset_index(drop=True)
Enter fullscreen mode Exit fullscreen mode
The columns will be exploded. but the first row starts at index 0. To make the first row to have index = 1:
<span>df</span><span>.</span><span>index</span> <span>=</span> <span>df</span><span>.</span><span>index</span> <span>+</span> <span>1</span><span>df</span><span>.</span><span>index</span> <span>=</span> <span>df</span><span>.</span><span>index</span> <span>+</span> <span>1</span>df.index = df.index + 1
Enter fullscreen mode Exit fullscreen mode
The data contains N/A in the rows which implies that these rows contain no relevant information and we need to drop these rows.
We need to replace the N/A values with np.nan
values which will be easier to drop.
<span>df</span> <span>=</span> <span>df</span><span>.</span><span>replace</span><span>(</span><span>'</span><span>N/A</span><span>'</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>)</span><span>df</span><span>.</span><span>dropna</span><span>(</span><span>inplace</span><span>=</span><span>True</span><span>)</span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>replace</span><span>(</span><span>'</span><span>N/A</span><span>'</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>)</span> <span>df</span><span>.</span><span>dropna</span><span>(</span><span>inplace</span><span>=</span><span>True</span><span>)</span>df = df.replace('N/A', np.nan) df.dropna(inplace=True)
Enter fullscreen mode Exit fullscreen mode
Now, the data is clean and ready to be loaded into an Excel sheet.
c. Load.
After transforming the data, Load the transformed data into an Excel sheet for better visualization and interaction with the end user.
Before loading the data into an Excel sheet, install openpyxl
which is a Python package that helps with loading of dataframes into Excel sheets. To install:
pip <span>install </span>openpyxlpip <span>install </span>openpyxlpip install openpyxl
Enter fullscreen mode Exit fullscreen mode
After installation, add the following line to add the data into an Excel sheet:
<span>df</span><span>.</span><span>to_excel</span><span>(</span><span>'</span><span>internships.xlsx</span><span>'</span><span>)</span><span>df</span><span>.</span><span>to_excel</span><span>(</span><span>'</span><span>internships.xlsx</span><span>'</span><span>)</span>df.to_excel('internships.xlsx')
Enter fullscreen mode Exit fullscreen mode
End result.
As a student and job-seeker, I wanted to automate the process of checking for new internship listings. Building this ETL pipeline allowed me to:
- Learn scripting
- Building ETL pipelines using Python.
- Cleaning and Manipulating data using
pandas
andnumpy
. - Automate manual tracking tasks
I’ll automate these tasks to be scraping on weekdays at 9:30AM EAT (East African Time) and a script that sends me emails to inform me that data is available for use in Excel.
ALERT: Some websites do not support scraping and it is wise to go through the company’s Terms and Conditions or Policy to check if they have any problems with scraping. If so, look for an API to fetch data as this will prevent any legal measures taken against you.
Happy hacking <3
原文链接: Building an ETL Pipeline with Python to Scrape Internship Jobs and Load into Excel
暂无评论内容