Company websites contain a lot of job listings that aren’t always available on popular job boards.
For example, finding remote startup jobs could be challenging, as these companies may not even be listed on the job boards.
To find these jobs you need to:
- Find promising companies
- Search for their career pages
- Analyze available job listings
- Manually record job details
It requires a lot of time, but we are going to automate it.
Preparation
We’ll use the Parsera library to automate job scraping. Parsera provides two usage options:
- Local: Pages are processed on your machine using an LLM of your choice;
- API: All processing occurs on Parsera’s servers.
In this example we’ll go with the Local option, since this is a one-time, small-scale extraction.
To get started, install the required packages:
pip <span>install </span>parseraplaywright <span>install</span>pip <span>install </span>parsera playwright <span>install</span>pip install parsera playwright install
Enter fullscreen mode Exit fullscreen mode
Since we’re running the local setup, an LLM connection is required.
We’ll use OpenAI’s gpt-4o-mini
, for simplicity, which only requires setting an environment variable:
<span>import</span> <span>os</span><span>from</span> <span>parsera</span> <span>import</span> <span>Parsera</span><span>os</span><span>.</span><span>environ</span><span>[</span><span>"</span><span>OPENAI_API_KEY</span><span>"</span><span>]</span> <span>=</span> <span>"</span><span><YOUR_OPENAI_API_KEY_HERE></span><span>"</span><span>scraper</span> <span>=</span> <span>Parsera</span><span>(</span><span>model</span><span>=</span><span>llm</span><span>)</span><span>import</span> <span>os</span> <span>from</span> <span>parsera</span> <span>import</span> <span>Parsera</span> <span>os</span><span>.</span><span>environ</span><span>[</span><span>"</span><span>OPENAI_API_KEY</span><span>"</span><span>]</span> <span>=</span> <span>"</span><span><YOUR_OPENAI_API_KEY_HERE></span><span>"</span> <span>scraper</span> <span>=</span> <span>Parsera</span><span>(</span><span>model</span><span>=</span><span>llm</span><span>)</span>import os from parsera import Parsera os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY_HERE>" scraper = Parsera(model=llm)
Enter fullscreen mode Exit fullscreen mode
With everything set up, we’re ready to start scraping.
Step 1: Getting a list of the fresh Series A startups
First, we need to find a list of companies of interest and their websites.
I’ve found a list of 100 Series A startups that closed their rounds last month.
Growing companies with fresh rounds seems like a good place to look.
Let’s grab the country and website of these companies:
<span>url</span> <span>=</span> <span>"</span><span>https://growthlist.co/series-a-startups/</span><span>"</span><span>elements</span> <span>=</span> <span>{</span><span>"</span><span>Website</span><span>"</span><span>:</span> <span>"</span><span>Website of the company</span><span>"</span><span>,</span><span>"</span><span>Country</span><span>"</span><span>:</span> <span>"</span><span>Country of the company</span><span>"</span><span>,</span><span>}</span><span>all_startups</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>url</span><span>,</span> <span>elements</span><span>=</span><span>elements</span><span>)</span><span>url</span> <span>=</span> <span>"</span><span>https://growthlist.co/series-a-startups/</span><span>"</span> <span>elements</span> <span>=</span> <span>{</span> <span>"</span><span>Website</span><span>"</span><span>:</span> <span>"</span><span>Website of the company</span><span>"</span><span>,</span> <span>"</span><span>Country</span><span>"</span><span>:</span> <span>"</span><span>Country of the company</span><span>"</span><span>,</span> <span>}</span> <span>all_startups</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>url</span><span>,</span> <span>elements</span><span>=</span><span>elements</span><span>)</span>url = "https://growthlist.co/series-a-startups/" elements = { "Website": "Website of the company", "Country": "Country of the company", } all_startups = await scraper.arun(url=url, elements=elements)
Enter fullscreen mode Exit fullscreen mode
Having the country, we can filter the country of our interest.
Let’s narrow down our search to the United States:
<span>us_websites</span> <span>=</span> <span>[</span><span>item</span><span>[</span><span>"</span><span>Website</span><span>"</span><span>]</span> <span>for</span> <span>item</span> <span>in</span> <span>all_startups</span> <span>if</span> <span>item</span><span>[</span><span>"</span><span>Country</span><span>"</span><span>]</span> <span>==</span> <span>"</span><span>United States</span><span>"</span><span>]</span><span>us_websites</span> <span>=</span> <span>[</span> <span>item</span><span>[</span><span>"</span><span>Website</span><span>"</span><span>]</span> <span>for</span> <span>item</span> <span>in</span> <span>all_startups</span> <span>if</span> <span>item</span><span>[</span><span>"</span><span>Country</span><span>"</span><span>]</span> <span>==</span> <span>"</span><span>United States</span><span>"</span> <span>]</span>us_websites = [ item["Website"] for item in all_startups if item["Country"] == "United States" ]
Enter fullscreen mode Exit fullscreen mode
Step 2: Finding Careers pages
Now, we have a list of the websites of new Series A startups from the US.
The next step is to find their careers page. We’ll do it straightforwardly by extracting careers pages from their main pages:
<span>from</span> <span>urllib.parse</span> <span>import</span> <span>urljoin</span><span># Define our target </span><span>careers_target</span> <span>=</span> <span>{</span><span>"</span><span>url</span><span>"</span><span>:</span> <span>"</span><span>Careers page url</span><span>"</span><span>}</span><span>careers_pages</span> <span>=</span> <span>[]</span><span>for</span> <span>website</span> <span>in</span> <span>us_websites</span><span>:</span><span>website</span> <span>=</span> <span>"</span><span>https://</span><span>"</span> <span>+</span> <span>website</span><span>result</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>website</span><span>,</span> <span>elements</span><span>=</span><span>careers_target</span><span>)</span><span>if</span> <span>len</span><span>(</span><span>result</span><span>)</span> <span>></span> <span>0</span><span>:</span><span>url</span> <span>=</span> <span>result</span><span>[</span><span>0</span><span>][</span><span>"</span><span>url</span><span>"</span><span>]</span><span>if</span> <span>url</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>url</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>./</span><span>"</span><span>):</span><span>url</span> <span>=</span> <span>urljoin</span><span>(</span><span>website</span><span>,</span> <span>url</span><span>)</span><span>careers_pages</span><span>.</span><span>append</span><span>(</span><span>url</span><span>)</span><span>from</span> <span>urllib.parse</span> <span>import</span> <span>urljoin</span> <span># Define our target </span><span>careers_target</span> <span>=</span> <span>{</span><span>"</span><span>url</span><span>"</span><span>:</span> <span>"</span><span>Careers page url</span><span>"</span><span>}</span> <span>careers_pages</span> <span>=</span> <span>[]</span> <span>for</span> <span>website</span> <span>in</span> <span>us_websites</span><span>:</span> <span>website</span> <span>=</span> <span>"</span><span>https://</span><span>"</span> <span>+</span> <span>website</span> <span>result</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>website</span><span>,</span> <span>elements</span><span>=</span><span>careers_target</span><span>)</span> <span>if</span> <span>len</span><span>(</span><span>result</span><span>)</span> <span>></span> <span>0</span><span>:</span> <span>url</span> <span>=</span> <span>result</span><span>[</span><span>0</span><span>][</span><span>"</span><span>url</span><span>"</span><span>]</span> <span>if</span> <span>url</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>url</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>./</span><span>"</span><span>):</span> <span>url</span> <span>=</span> <span>urljoin</span><span>(</span><span>website</span><span>,</span> <span>url</span><span>)</span> <span>careers_pages</span><span>.</span><span>append</span><span>(</span><span>url</span><span>)</span>from urllib.parse import urljoin # Define our target careers_target = {"url": "Careers page url"} careers_pages = [] for website in us_websites: website = "https://" + website result = await scraper.arun(url=website, elements=careers_target) if len(result) > 0: url = result[0]["url"] if url.startswith("/") or url.startswith("./"): url = urljoin(website, url) careers_pages.append(url)
Enter fullscreen mode Exit fullscreen mode
Note, that there is an option to replace this step with calling Search API, replacing LLM calls with search calls.
Step 3: Scraping open jobs
The last step is to load all open jobs from the careers pages of the websites.
Let’s say we are looking for a software engineering job, then we’ll look for the job title, location, link, and whether it’s related to software engineering:
<span>jobs_target</span> <span>=</span> <span>{</span><span>"</span><span>Title</span><span>"</span><span>:</span> <span>"</span><span>Title of the job</span><span>"</span><span>,</span><span>"</span><span>Location</span><span>"</span><span>:</span> <span>"</span><span>Location of the job</span><span>"</span><span>,</span><span>"</span><span>Link</span><span>"</span><span>:</span> <span>"</span><span>Link to the job post</span><span>"</span><span>,</span><span>"</span><span>SE</span><span>"</span><span>:</span> <span>"</span><span>True if this is a software engineering job, otherwise False</span><span>"</span><span>,</span><span>}</span><span>jobs</span> <span>=</span> <span>[]</span><span>for</span> <span>page</span> <span>in</span> <span>careers_pages</span><span>:</span><span>result</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>page</span><span>,</span> <span>elements</span><span>=</span><span>jobs_target</span><span>)</span><span>if</span> <span>len</span><span>(</span><span>result</span><span>)</span> <span>></span> <span>0</span><span>:</span><span>for</span> <span>row</span> <span>in</span> <span>result</span><span>:</span><span>row</span><span>[</span><span>"</span><span>url</span><span>"</span><span>]</span> <span>=</span> <span>page</span><span>row</span><span>[</span><span>"</span><span>Link</span><span>"</span><span>]</span> <span>=</span> <span>urljoin</span><span>(</span><span>row</span><span>[</span><span>"</span><span>url</span><span>"</span><span>],</span> <span>row</span><span>[</span><span>"</span><span>Link</span><span>"</span><span>])</span><span>jobs</span><span>.</span><span>extend</span><span>(</span><span>result</span><span>)</span><span>jobs_target</span> <span>=</span> <span>{</span> <span>"</span><span>Title</span><span>"</span><span>:</span> <span>"</span><span>Title of the job</span><span>"</span><span>,</span> <span>"</span><span>Location</span><span>"</span><span>:</span> <span>"</span><span>Location of the job</span><span>"</span><span>,</span> <span>"</span><span>Link</span><span>"</span><span>:</span> <span>"</span><span>Link to the job post</span><span>"</span><span>,</span> <span>"</span><span>SE</span><span>"</span><span>:</span> <span>"</span><span>True if this is a software engineering job, otherwise False</span><span>"</span><span>,</span> <span>}</span> <span>jobs</span> <span>=</span> <span>[]</span> <span>for</span> <span>page</span> <span>in</span> <span>careers_pages</span><span>:</span> <span>result</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>page</span><span>,</span> <span>elements</span><span>=</span><span>jobs_target</span><span>)</span> <span>if</span> <span>len</span><span>(</span><span>result</span><span>)</span> <span>></span> <span>0</span><span>:</span> <span>for</span> <span>row</span> <span>in</span> <span>result</span><span>:</span> <span>row</span><span>[</span><span>"</span><span>url</span><span>"</span><span>]</span> <span>=</span> <span>page</span> <span>row</span><span>[</span><span>"</span><span>Link</span><span>"</span><span>]</span> <span>=</span> <span>urljoin</span><span>(</span><span>row</span><span>[</span><span>"</span><span>url</span><span>"</span><span>],</span> <span>row</span><span>[</span><span>"</span><span>Link</span><span>"</span><span>])</span> <span>jobs</span><span>.</span><span>extend</span><span>(</span><span>result</span><span>)</span>jobs_target = { "Title": "Title of the job", "Location": "Location of the job", "Link": "Link to the job post", "SE": "True if this is a software engineering job, otherwise False", } jobs = [] for page in careers_pages: result = await scraper.arun(url=page, elements=jobs_target) if len(result) > 0: for row in result: row["url"] = page row["Link"] = urljoin(row["url"], row["Link"]) jobs.extend(result)
Enter fullscreen mode Exit fullscreen mode
All jobs are extracted and we can filter out all that is not software engineering and save them to a .csv
file:
<span>import</span> <span>csv</span><span>engineering_jobs</span> <span>=</span> <span>[</span><span>job</span> <span>for</span> <span>job</span> <span>in</span> <span>jobs</span> <span>if</span> <span>job</span><span>[</span><span>"</span><span>SE</span><span>"</span><span>]</span> <span>==</span> <span>"</span><span>True</span><span>"</span><span>]</span><span>with</span> <span>open</span><span>(</span><span>"</span><span>jobs.csv</span><span>"</span><span>,</span> <span>"</span><span>w</span><span>"</span><span>)</span> <span>as</span> <span>f</span><span>:</span><span>write</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>f</span><span>)</span><span>write</span><span>.</span><span>writerow</span><span>(</span><span>engineering_jobs</span><span>[</span><span>0</span><span>].</span><span>keys</span><span>())</span><span>for</span> <span>job</span> <span>in</span> <span>engineering_jobs</span><span>:</span><span>write</span><span>.</span><span>writerow</span><span>(</span><span>job</span><span>.</span><span>values</span><span>())</span><span>import</span> <span>csv</span> <span>engineering_jobs</span> <span>=</span> <span>[</span><span>job</span> <span>for</span> <span>job</span> <span>in</span> <span>jobs</span> <span>if</span> <span>job</span><span>[</span><span>"</span><span>SE</span><span>"</span><span>]</span> <span>==</span> <span>"</span><span>True</span><span>"</span><span>]</span> <span>with</span> <span>open</span><span>(</span><span>"</span><span>jobs.csv</span><span>"</span><span>,</span> <span>"</span><span>w</span><span>"</span><span>)</span> <span>as</span> <span>f</span><span>:</span> <span>write</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>f</span><span>)</span> <span>write</span><span>.</span><span>writerow</span><span>(</span><span>engineering_jobs</span><span>[</span><span>0</span><span>].</span><span>keys</span><span>())</span> <span>for</span> <span>job</span> <span>in</span> <span>engineering_jobs</span><span>:</span> <span>write</span><span>.</span><span>writerow</span><span>(</span><span>job</span><span>.</span><span>values</span><span>())</span>import csv engineering_jobs = [job for job in jobs if job["SE"] == "True"] with open("jobs.csv", "w") as f: write = csv.writer(f) write.writerow(engineering_jobs[0].keys()) for job in engineering_jobs: write.writerow(job.values())
Enter fullscreen mode Exit fullscreen mode
At the end, we have a table with a list of jobs that looks like this:
Title | Location | Link | SE | url |
---|---|---|---|---|
AI Tech Lead Manager | Bengaluru | https://job-boards.greenhouse.io/enterpret/jobs/6286095003 | True | https://boards.greenhouse.io/enterpret/ |
Backend Developer | Tel Aviv | https://www.upwind.io/careers/co/tel-aviv/BA.04A/backend-developer/all#jobs | True | https://www.upwind.io/careers |
… | … | … | … | … |
Conclusion
As a next step, we could repeat the same process to extract more info from the full job listing.
Like getting the tech stack or filtering for a remote startup job. This will save time on manually reviewing all pages.
You can try it yourself by iterating over Link
fields and extracting elements
of your interest.
Hope you found this article helpful and if you have any questions let me know.
暂无评论内容