ChatGPT helped me scrape cocktail websites to create a universal drink index.
Lately, I’ve become a fan of making cocktails. After getting the best equipment and booze money can buy, I realized the leading blocker for me was cocktail recipes. I wanted a platform where I could input the ingredients I have at home, and the output would be a list of cocktails I could make. There are a few apps for that, but they are limited in their variety unless you pay for the app. And since the only thing I like more than cocktails is free cocktails, I was looking for an alternative.
It was about a week after ChatGPT was released, and it had already become my best friend. Could it also be my drinking buddy? I realized many recipes were available online, and I just had to index them correctly. Can ChatGPT build me a tool to index them on my own?
TLDR — yes, here is the GitHub repo. There are even ChatGPT-generated READMEs.
The Plan
I decided to give it a go. I wanted to use ChatGPT to build a simple system to index cocktails from the web. The solution would have two parts:
- Crawler — a process that gets a domain to crawl and outputs URLs that appear to be cocktail recipes to a queue. It also follows additional URLs on the page recursively to search for other pages with recipes. This seemed like an easy enough task to let ChatGPT code on its own
- Indexer — This component is meant to get a URL, determine if the page contains a cocktail recipe, and store it in a database. The problem is that blogs with cocktail recipes are highly unstructured and have a lot of unnecessary text before reaching the point. Can ChatGPT help me make sense of this mess?
With the plan in place, I set out to start building. I started by laying down the architecture and had to decide what queue to use. ChatGPT came to the rescue:
After reviewing its suggestions, I decided to go with RabbitMQ since I had some experience with it. I asked ChatGPT to lay out the foundations for me:
With the project set up, the next step was for ChatGTP to develop the crawler.
The Crawler
I asked ChatGPT to do the work for me, and boy, did it deliver:
When I ran the program, I came into two issues. First, the visited URLs were not cached, and since almost any page was linked to the home page, we ended up in an infinite loop. I asked ChatGPT to handle that, and it modified the code correctly:
The second issue was that it started crawling other domains as well — for example, for a page containing a video, the crawler started crawling youtube as well. I asked ChatGTP to fix that as well, and it obliged:
And that was it! The crawler was ready, and the next step was to code the indexer.
The Indexer
As I mentioned before, parsing the content of a web page to identify if it describes a cocktail recipe is challenging. Here is an example of a cocktail recipe web page:
It’s almost impossible to handle all the different use cases to parse the page and extract the ingredients. Once again — ChatGPT to the rescue!
I came across an unofficial python implementation of the ChatGPT API and decided to use it in my indexer. The idea was simple, using a well-crafted prompt, I should be able to use ChatGPT to extract the ingredients from the cocktail page. Since ChatGPT has no internet access, it couldn’t code this part, but it did help me with the generic components of it. Here is the code I used:
import pika
import time
import sys
import requests
from bs4 import BeautifulSoup
from revChatGPT.revChatGPT import Chatbot
from db import CocktailRecipe, Ingredient, session
import traceback
print(sys.argv)
config = {
"session_token": sys.argv[1],
"cf_clearance": sys.argv[2],
"user_agent": sys.argv[3]
}
chatbot = Chatbot(config, conversation_id=None)
# connect to RabbitMQ connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
# create a queue to consume messages from, if it does not already exist channel.queue_declare(queue='drink_urls', durable=True)
QUESTION = """ From the following text below, please understand if it is an article describing a cocktail recipe. If it is, output the following: The first line should be: "<cocktail name>: <ranking>". If the ranking doesn't exist output "-". use only lowercase letters and the generic name of the cocktail , without mentioning brands. the following lines contain the ingredients: each line should contain one ingredient and it's amount in the format "<ingredient>: <amount>". The ingredient part (before the semicolon) should contain only the ingredient name, not the amount. If it is not a cocktail recepie, output only the word "no". Output the ingredients in their generic name, and don't include a brand. For example, instead of "bacardi white rum" output "white rum". All of the output should be lower cased, don't capitalize any word. The text is: %s"""
# define a callback function to process incoming messages def process_message(ch, method, properties, body):
time.sleep(5)
try:
print(body)
page = requests.get(body)
soup = BeautifulSoup(page.content, "html.parser")
response = chatbot.get_chat_response(QUESTION % soup.text, output="text")['message']
if response == "no":
print("not a cocktail")
return
print(response)
except Exception:
print(traceback.format_exc())
# consume messages from the queue, using the callback function to process them channel.basic_consume(queue='drink_urls', on_message_callback=process_message, auto_ack=True)
# start consuming messages channel.start_consuming()
Enter fullscreen mode Exit fullscreen mode
This worked like a charm. ChatGPT did an excellent job understanding whether or not a page contained a cocktail recipe, and if it did, the output was almost always in the correct format. I used to think coding with Python was explicit as coding could be, but it was never that close.
Now I just had to store it in a database. I asked ChatGPT to create the database for me:
Then, I gave it a sample input generated by itself and asked it to write code for inserting the information into the database. The initial implementation it provided was using psycopg2. I asked it to use SQLAlchemy as I find it easier to work with an ORM:
I integrated this code into the indexer and finally had the cocktail database of my dreams!
Next, I will use ChatGPT to help me create an API and UI to browse the cocktails database and choose the one I want to make today.
Conclusion
This whole project took me roughly 3 hours from end to end. ChatGPT’s ability to accelerate my development process blew my mind. The biggest win was the text parsing ChatGPT provided, which I could not do myself, and ChatGPT made simple.
However, the more complex a task got, the chances of ChatGPT performing it correctly decreased drastically. When pairing with ChatGPT, it is still the developer’s job to break down the work into small enough tasks for it to swallow, at least for now.
I’m excited to see what comes next!
暂无评论内容