Search startup jobs with Python and LLMs

Company websites contain a lot of job listings that aren’t always available on popular job boards.
For example, finding remote startup jobs could be challenging, as these companies may not even be listed on the job boards.
To find these jobs you need to:

  • Find promising companies
  • Search for their career pages
  • Analyze available job listings
  • Manually record job details

It requires a lot of time, but we are going to automate it.

Preparation

We’ll use the Parsera library to automate job scraping. Parsera provides two usage options:

  • Local: Pages are processed on your machine using an LLM of your choice;
  • API: All processing occurs on Parsera’s servers.

In this example we’ll go with the Local option, since this is a one-time, small-scale extraction.

To get started, install the required packages:

pip <span>install </span>parsera
playwright <span>install</span>
pip <span>install </span>parsera
playwright <span>install</span>
pip install parsera playwright install

Enter fullscreen mode Exit fullscreen mode

Since we’re running the local setup, an LLM connection is required.
We’ll use OpenAI’s gpt-4o-mini, for simplicity, which only requires setting an environment variable:

<span>import</span> <span>os</span>
<span>from</span> <span>parsera</span> <span>import</span> <span>Parsera</span>
<span>os</span><span>.</span><span>environ</span><span>[</span><span>"</span><span>OPENAI_API_KEY</span><span>"</span><span>]</span> <span>=</span> <span>"</span><span><YOUR_OPENAI_API_KEY_HERE></span><span>"</span>
<span>scraper</span> <span>=</span> <span>Parsera</span><span>(</span><span>model</span><span>=</span><span>llm</span><span>)</span>
<span>import</span> <span>os</span>
<span>from</span> <span>parsera</span> <span>import</span> <span>Parsera</span>

<span>os</span><span>.</span><span>environ</span><span>[</span><span>"</span><span>OPENAI_API_KEY</span><span>"</span><span>]</span> <span>=</span> <span>"</span><span><YOUR_OPENAI_API_KEY_HERE></span><span>"</span>

<span>scraper</span> <span>=</span> <span>Parsera</span><span>(</span><span>model</span><span>=</span><span>llm</span><span>)</span>
import os from parsera import Parsera os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY_HERE>" scraper = Parsera(model=llm)

Enter fullscreen mode Exit fullscreen mode

With everything set up, we’re ready to start scraping.

Step 1: Getting a list of the fresh Series A startups

First, we need to find a list of companies of interest and their websites.
I’ve found a list of 100 Series A startups that closed their rounds last month.
Growing companies with fresh rounds seems like a good place to look.

Let’s grab the country and website of these companies:

<span>url</span> <span>=</span> <span>"</span><span>https://growthlist.co/series-a-startups/</span><span>"</span>
<span>elements</span> <span>=</span> <span>{</span>
<span>"</span><span>Website</span><span>"</span><span>:</span> <span>"</span><span>Website of the company</span><span>"</span><span>,</span>
<span>"</span><span>Country</span><span>"</span><span>:</span> <span>"</span><span>Country of the company</span><span>"</span><span>,</span>
<span>}</span>
<span>all_startups</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>url</span><span>,</span> <span>elements</span><span>=</span><span>elements</span><span>)</span>
<span>url</span> <span>=</span> <span>"</span><span>https://growthlist.co/series-a-startups/</span><span>"</span>
<span>elements</span> <span>=</span> <span>{</span>
    <span>"</span><span>Website</span><span>"</span><span>:</span> <span>"</span><span>Website of the company</span><span>"</span><span>,</span>
    <span>"</span><span>Country</span><span>"</span><span>:</span> <span>"</span><span>Country of the company</span><span>"</span><span>,</span>
<span>}</span>
<span>all_startups</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>url</span><span>,</span> <span>elements</span><span>=</span><span>elements</span><span>)</span>
url = "https://growthlist.co/series-a-startups/" elements = { "Website": "Website of the company", "Country": "Country of the company", } all_startups = await scraper.arun(url=url, elements=elements)

Enter fullscreen mode Exit fullscreen mode

Having the country, we can filter the country of our interest.
Let’s narrow down our search to the United States:

<span>us_websites</span> <span>=</span> <span>[</span>
<span>item</span><span>[</span><span>"</span><span>Website</span><span>"</span><span>]</span> <span>for</span> <span>item</span> <span>in</span> <span>all_startups</span> <span>if</span> <span>item</span><span>[</span><span>"</span><span>Country</span><span>"</span><span>]</span> <span>==</span> <span>"</span><span>United States</span><span>"</span>
<span>]</span>
<span>us_websites</span> <span>=</span> <span>[</span>
    <span>item</span><span>[</span><span>"</span><span>Website</span><span>"</span><span>]</span> <span>for</span> <span>item</span> <span>in</span> <span>all_startups</span> <span>if</span> <span>item</span><span>[</span><span>"</span><span>Country</span><span>"</span><span>]</span> <span>==</span> <span>"</span><span>United States</span><span>"</span>
<span>]</span>
us_websites = [ item["Website"] for item in all_startups if item["Country"] == "United States" ]

Enter fullscreen mode Exit fullscreen mode

Step 2: Finding Careers pages

Now, we have a list of the websites of new Series A startups from the US.
The next step is to find their careers page. We’ll do it straightforwardly by extracting careers pages from their main pages:

<span>from</span> <span>urllib.parse</span> <span>import</span> <span>urljoin</span>
<span># Define our target </span><span>careers_target</span> <span>=</span> <span>{</span><span>"</span><span>url</span><span>"</span><span>:</span> <span>"</span><span>Careers page url</span><span>"</span><span>}</span>
<span>careers_pages</span> <span>=</span> <span>[]</span>
<span>for</span> <span>website</span> <span>in</span> <span>us_websites</span><span>:</span>
<span>website</span> <span>=</span> <span>"</span><span>https://</span><span>"</span> <span>+</span> <span>website</span>
<span>result</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>website</span><span>,</span> <span>elements</span><span>=</span><span>careers_target</span><span>)</span>
<span>if</span> <span>len</span><span>(</span><span>result</span><span>)</span> <span>></span> <span>0</span><span>:</span>
<span>url</span> <span>=</span> <span>result</span><span>[</span><span>0</span><span>][</span><span>"</span><span>url</span><span>"</span><span>]</span>
<span>if</span> <span>url</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>url</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>./</span><span>"</span><span>):</span>
<span>url</span> <span>=</span> <span>urljoin</span><span>(</span><span>website</span><span>,</span> <span>url</span><span>)</span>
<span>careers_pages</span><span>.</span><span>append</span><span>(</span><span>url</span><span>)</span>
<span>from</span> <span>urllib.parse</span> <span>import</span> <span>urljoin</span>

<span># Define our target </span><span>careers_target</span> <span>=</span> <span>{</span><span>"</span><span>url</span><span>"</span><span>:</span> <span>"</span><span>Careers page url</span><span>"</span><span>}</span>

<span>careers_pages</span> <span>=</span> <span>[]</span>
<span>for</span> <span>website</span> <span>in</span> <span>us_websites</span><span>:</span>
    <span>website</span> <span>=</span> <span>"</span><span>https://</span><span>"</span> <span>+</span> <span>website</span>
    <span>result</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>website</span><span>,</span> <span>elements</span><span>=</span><span>careers_target</span><span>)</span>
    <span>if</span> <span>len</span><span>(</span><span>result</span><span>)</span> <span>></span> <span>0</span><span>:</span>
        <span>url</span> <span>=</span> <span>result</span><span>[</span><span>0</span><span>][</span><span>"</span><span>url</span><span>"</span><span>]</span>
        <span>if</span> <span>url</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>/</span><span>"</span><span>)</span> <span>or</span> <span>url</span><span>.</span><span>startswith</span><span>(</span><span>"</span><span>./</span><span>"</span><span>):</span>
            <span>url</span> <span>=</span> <span>urljoin</span><span>(</span><span>website</span><span>,</span> <span>url</span><span>)</span>
        <span>careers_pages</span><span>.</span><span>append</span><span>(</span><span>url</span><span>)</span>
from urllib.parse import urljoin # Define our target careers_target = {"url": "Careers page url"} careers_pages = [] for website in us_websites: website = "https://" + website result = await scraper.arun(url=website, elements=careers_target) if len(result) > 0: url = result[0]["url"] if url.startswith("/") or url.startswith("./"): url = urljoin(website, url) careers_pages.append(url)

Enter fullscreen mode Exit fullscreen mode

Note, that there is an option to replace this step with calling Search API, replacing LLM calls with search calls.

Step 3: Scraping open jobs

The last step is to load all open jobs from the careers pages of the websites.
Let’s say we are looking for a software engineering job, then we’ll look for the job title, location, link, and whether it’s related to software engineering:

<span>jobs_target</span> <span>=</span> <span>{</span>
<span>"</span><span>Title</span><span>"</span><span>:</span> <span>"</span><span>Title of the job</span><span>"</span><span>,</span>
<span>"</span><span>Location</span><span>"</span><span>:</span> <span>"</span><span>Location of the job</span><span>"</span><span>,</span>
<span>"</span><span>Link</span><span>"</span><span>:</span> <span>"</span><span>Link to the job post</span><span>"</span><span>,</span>
<span>"</span><span>SE</span><span>"</span><span>:</span> <span>"</span><span>True if this is a software engineering job, otherwise False</span><span>"</span><span>,</span>
<span>}</span>
<span>jobs</span> <span>=</span> <span>[]</span>
<span>for</span> <span>page</span> <span>in</span> <span>careers_pages</span><span>:</span>
<span>result</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>page</span><span>,</span> <span>elements</span><span>=</span><span>jobs_target</span><span>)</span>
<span>if</span> <span>len</span><span>(</span><span>result</span><span>)</span> <span>></span> <span>0</span><span>:</span>
<span>for</span> <span>row</span> <span>in</span> <span>result</span><span>:</span>
<span>row</span><span>[</span><span>"</span><span>url</span><span>"</span><span>]</span> <span>=</span> <span>page</span>
<span>row</span><span>[</span><span>"</span><span>Link</span><span>"</span><span>]</span> <span>=</span> <span>urljoin</span><span>(</span><span>row</span><span>[</span><span>"</span><span>url</span><span>"</span><span>],</span> <span>row</span><span>[</span><span>"</span><span>Link</span><span>"</span><span>])</span>
<span>jobs</span><span>.</span><span>extend</span><span>(</span><span>result</span><span>)</span>
<span>jobs_target</span> <span>=</span> <span>{</span>
    <span>"</span><span>Title</span><span>"</span><span>:</span> <span>"</span><span>Title of the job</span><span>"</span><span>,</span>
    <span>"</span><span>Location</span><span>"</span><span>:</span> <span>"</span><span>Location of the job</span><span>"</span><span>,</span>
    <span>"</span><span>Link</span><span>"</span><span>:</span> <span>"</span><span>Link to the job post</span><span>"</span><span>,</span>
    <span>"</span><span>SE</span><span>"</span><span>:</span> <span>"</span><span>True if this is a software engineering job, otherwise False</span><span>"</span><span>,</span>
<span>}</span>

<span>jobs</span> <span>=</span> <span>[]</span>
<span>for</span> <span>page</span> <span>in</span> <span>careers_pages</span><span>:</span>
    <span>result</span> <span>=</span> <span>await</span> <span>scraper</span><span>.</span><span>arun</span><span>(</span><span>url</span><span>=</span><span>page</span><span>,</span> <span>elements</span><span>=</span><span>jobs_target</span><span>)</span>
    <span>if</span> <span>len</span><span>(</span><span>result</span><span>)</span> <span>></span> <span>0</span><span>:</span>
        <span>for</span> <span>row</span> <span>in</span> <span>result</span><span>:</span>
            <span>row</span><span>[</span><span>"</span><span>url</span><span>"</span><span>]</span> <span>=</span> <span>page</span>
            <span>row</span><span>[</span><span>"</span><span>Link</span><span>"</span><span>]</span> <span>=</span> <span>urljoin</span><span>(</span><span>row</span><span>[</span><span>"</span><span>url</span><span>"</span><span>],</span> <span>row</span><span>[</span><span>"</span><span>Link</span><span>"</span><span>])</span>
    <span>jobs</span><span>.</span><span>extend</span><span>(</span><span>result</span><span>)</span>
jobs_target = { "Title": "Title of the job", "Location": "Location of the job", "Link": "Link to the job post", "SE": "True if this is a software engineering job, otherwise False", } jobs = [] for page in careers_pages: result = await scraper.arun(url=page, elements=jobs_target) if len(result) > 0: for row in result: row["url"] = page row["Link"] = urljoin(row["url"], row["Link"]) jobs.extend(result)

Enter fullscreen mode Exit fullscreen mode

All jobs are extracted and we can filter out all that is not software engineering and save them to a .csv file:

<span>import</span> <span>csv</span>
<span>engineering_jobs</span> <span>=</span> <span>[</span><span>job</span> <span>for</span> <span>job</span> <span>in</span> <span>jobs</span> <span>if</span> <span>job</span><span>[</span><span>"</span><span>SE</span><span>"</span><span>]</span> <span>==</span> <span>"</span><span>True</span><span>"</span><span>]</span>
<span>with</span> <span>open</span><span>(</span><span>"</span><span>jobs.csv</span><span>"</span><span>,</span> <span>"</span><span>w</span><span>"</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
<span>write</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>f</span><span>)</span>
<span>write</span><span>.</span><span>writerow</span><span>(</span><span>engineering_jobs</span><span>[</span><span>0</span><span>].</span><span>keys</span><span>())</span>
<span>for</span> <span>job</span> <span>in</span> <span>engineering_jobs</span><span>:</span>
<span>write</span><span>.</span><span>writerow</span><span>(</span><span>job</span><span>.</span><span>values</span><span>())</span>
<span>import</span> <span>csv</span>

<span>engineering_jobs</span> <span>=</span> <span>[</span><span>job</span> <span>for</span> <span>job</span> <span>in</span> <span>jobs</span> <span>if</span> <span>job</span><span>[</span><span>"</span><span>SE</span><span>"</span><span>]</span> <span>==</span> <span>"</span><span>True</span><span>"</span><span>]</span>

<span>with</span> <span>open</span><span>(</span><span>"</span><span>jobs.csv</span><span>"</span><span>,</span> <span>"</span><span>w</span><span>"</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
    <span>write</span> <span>=</span> <span>csv</span><span>.</span><span>writer</span><span>(</span><span>f</span><span>)</span>
    <span>write</span><span>.</span><span>writerow</span><span>(</span><span>engineering_jobs</span><span>[</span><span>0</span><span>].</span><span>keys</span><span>())</span>
    <span>for</span> <span>job</span> <span>in</span> <span>engineering_jobs</span><span>:</span>
        <span>write</span><span>.</span><span>writerow</span><span>(</span><span>job</span><span>.</span><span>values</span><span>())</span>
import csv engineering_jobs = [job for job in jobs if job["SE"] == "True"] with open("jobs.csv", "w") as f: write = csv.writer(f) write.writerow(engineering_jobs[0].keys()) for job in engineering_jobs: write.writerow(job.values())

Enter fullscreen mode Exit fullscreen mode

At the end, we have a table with a list of jobs that looks like this:

Title Location Link SE url
AI Tech Lead Manager Bengaluru https://job-boards.greenhouse.io/enterpret/jobs/6286095003 True https://boards.greenhouse.io/enterpret/
Backend Developer Tel Aviv https://www.upwind.io/careers/co/tel-aviv/BA.04A/backend-developer/all#jobs True https://www.upwind.io/careers

Conclusion

As a next step, we could repeat the same process to extract more info from the full job listing.
Like getting the tech stack or filtering for a remote startup job. This will save time on manually reviewing all pages.
You can try it yourself by iterating over Link fields and extracting elements of your interest.

Hope you found this article helpful and if you have any questions let me know.

原文链接:Search startup jobs with Python and LLMs

© 版权声明
THE END
喜欢就支持一下吧
点赞7 分享
With the wonder of your love, the sun above always shines.
拥有你美丽的爱情,太阳就永远明媚
评论 抢沙发

请登录后发表评论

    暂无评论内容