There are lots of ways to leverage generative AI (GenAI) in a variety of business use cases at companies of all sizes. In this post, we will explore how a store selling crystals and precious stones can use DataStax’s RAGStack to help their customers to identify and find certain crystals. Specifically, we will walk through creating an application designed to help the customers of Healing House Energy Spa (owned by the author’s wife). This will also demonstrate how small businesses can take advantage of GenAI.
What is RAGStack?
RAGStack is DataStax’s Python library that’s designed to help developers build advanced GenAI applications based on retrieval-augmented generation (RAG) techniques. These applications require developers to configure and access data parsers, large language models (LLMs), and vector databases.
With RAGStack, developers can increase their productivity with GenAI toolsets by interacting with them through a single development stack. DataStax’s integrations with many commonly used libraries and providers enable developers to prototype and build applications faster than ever before. All of this happens on top of DataStax Astra DB, which is DataStax’s powerful, multi-region vector database (as shown in Figure 1).
Figure 1 – A high-level view of the Crystal Search application architecture, showing how it leverages RAGStack.
As Astra DB is a key component of RAGStack, we should spend some time discussing vector databases. These are special kinds of databases capable of storing vector data in native structures. When we build RAG applications, we interact with an LLM by using a “vectorized” version of our data. Essentially, the vectors returned are a numerical representation of the individual elements or “chunks” of our data. We will discuss this process in more detail below.
The Crystal Search application
Here we’ll walk through how to build up a simple web application to search an inventory of crystals (and other precious stones). We’ll load our data from a CSV file, and then query it using a Flask-based web application with navigation drop-downs and a search-by-image function.
The crystals themselves have several properties:
- Name What the crystal is known as.
- Image The filename of the on-disk image of the crystal.
- Chakras One or more of the seven centers of spiritual power in the human body that the crystal can help attune.
- Birth month People with certain birth months will be more receptive to this crystal.
- Zodiac sign People born under certain zodiac signs will be more receptive to this crystal.
- Mohs hardness A measure of the crystal’s resistance to scratching.
For our drop-down navigation, we will use a crystal’s recommended chakras, birth month, and zodiac signs. The remaining properties will be added to the collection’s metadata (except for the image itself, which will be used to generate the crystal’s vector embedding).
We will use the CLIP model to generate our vector embeddings. CLIP (Contrastive Language-Image Pre-training) is a sentence transformer model (developed by OpenAI) used to store both images and text in the same vector space. The CLIP model is pre-trained with images and text descriptions, and enables us to return results using an approximate nearest neighbor (ANN) algorithm. Leveraging CLIP in this way allows us to support an “identify this crystal” function, where users will be able to search with a picture from their device.
Requirements
Before building our application, let’s make sure that we properly configure our development environment. We will start by making sure that our Python version is at least on version 3.9. We will also need the following libraries (and versions), as specified in our [requirements.txt](https://github.com/aar0np/crystalSearch/blob/main/requirements.txt)
file.
Flask==2.3.2
Flask-WTF==1.2.1
sentence-transformers==2.2.2
ragstack-ai==0.8.0
-
python-dotenv==1.0.0
pip install -r requirements.txtpip install -r requirements.txtpip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode
Flask directory structure
As we are working with a Flask web application, we will need the following directory structure, with crystalSearch
as the “root” of the project:
crystalSearch/templates/static/images/input_images/web_images/crystalSearch/ templates/ static/ images/ input_images/ web_images/crystalSearch/ templates/ static/ images/ input_images/ web_images/
Enter fullscreen mode Exit fullscreen mode
DataStax Astra DB
First, we need to sign up for a free account with DataStax Astra DB, and create a new vector database. Once we have our Astra DB vector database, we will make note of the token and API endpoint. We will define those as environment variables in the next section.
Environment variables
For our application to run properly, we’ll need to set some environment variables:
-
ASTRA_DB_API_ENDPOINT
– Connection endpoint for our Astra DB vector database instance. -
ASTRA_DB_APPLICATION_TOKEN
– Security token used to authenticate to our Astra DB instance. -
FLASK_APP
– The name of the application’s primary Python file in a Flask web project. -
FLASK_ENV
– Indicates to Flask if the application is in development or production mode.
Of course, the easiest way to do that is with an .env
file. Our .env
file, should look something like this:
ASTRA_DB_API_ENDPOINT=https://notreal-blah-4444-blah-blah-region.apps.astra.datastax.comASTRA_DB_APPLICATION_TOKEN=AstraCS:NotReal:ButYourTokenWillLookSomethingLikeThisFLASK_APP=crystalSearchFLASK_ENV=developmentASTRA_DB_API_ENDPOINT=https://notreal-blah-4444-blah-blah-region.apps.astra.datastax.com ASTRA_DB_APPLICATION_TOKEN=AstraCS:NotReal:ButYourTokenWillLookSomethingLikeThis FLASK_APP=crystalSearch FLASK_ENV=developmentASTRA_DB_API_ENDPOINT=https://notreal-blah-4444-blah-blah-region.apps.astra.datastax.com ASTRA_DB_APPLICATION_TOKEN=AstraCS:NotReal:ButYourTokenWillLookSomethingLikeThis FLASK_APP=crystalSearch FLASK_ENV=development
Enter fullscreen mode Exit fullscreen mode
Setting the FLASK_APP variable to “crystalSearch” is important, as it tells Flask which Python module is the primary entrypoint to the application.
crystalLoader.py
With our database and environment all set up, we can build our Python data loader. Create a new Python file named crystalLoader.py
, and set up its imports like this:
import csvimport jsonfrom os import path, environfrom dotenv import load_dotenvfrom PIL import Imagefrom astrapy.db import AstraDBfrom sentence_transformers import SentenceTransformerimport csv import json from os import path, environ from dotenv import load_dotenv from PIL import Image from astrapy.db import AstraDB from sentence_transformers import SentenceTransformerimport csv import json from os import path, environ from dotenv import load_dotenv from PIL import Image from astrapy.db import AstraDB from sentence_transformers import SentenceTransformer
Enter fullscreen mode Exit fullscreen mode
We will start by bringing in the environment variables from our .env
file:
basedir = path.abspath(path.dirname(__file__))load_dotenv(path.join(basedir, '.env'))basedir = path.abspath(path.dirname(__file__)) load_dotenv(path.join(basedir, '.env'))basedir = path.abspath(path.dirname(__file__)) load_dotenv(path.join(basedir, '.env'))
Enter fullscreen mode Exit fullscreen mode
Next, we will pull in the application endpoint and token, instantiate a database connection object, and then create a new collection named “crystal_data”:
# Astra connectionASTRA_DB_APPLICATION_TOKEN = environ.get("ASTRA_DB_APPLICATION_TOKEN")ASTRA_DB_API_ENDPOINT= environ.get("ASTRA_DB_API_ENDPOINT")db = AstraDB(token=ASTRA_DB_APPLICATION_TOKEN,api_endpoint=ASTRA_DB_API_ENDPOINT,)# create "collection"col = db.create_collection("crystal_data", dimension=512, metric="cosine")# Astra connection ASTRA_DB_APPLICATION_TOKEN = environ.get("ASTRA_DB_APPLICATION_TOKEN") ASTRA_DB_API_ENDPOINT= environ.get("ASTRA_DB_API_ENDPOINT") db = AstraDB( token=ASTRA_DB_APPLICATION_TOKEN, api_endpoint=ASTRA_DB_API_ENDPOINT, ) # create "collection" col = db.create_collection("crystal_data", dimension=512, metric="cosine")# Astra connection ASTRA_DB_APPLICATION_TOKEN = environ.get("ASTRA_DB_APPLICATION_TOKEN") ASTRA_DB_API_ENDPOINT= environ.get("ASTRA_DB_API_ENDPOINT") db = AstraDB( token=ASTRA_DB_APPLICATION_TOKEN, api_endpoint=ASTRA_DB_API_ENDPOINT, ) # create "collection" col = db.create_collection("crystal_data", dimension=512, metric="cosine")
Enter fullscreen mode Exit fullscreen mode
Note that our collection will have a vector capable of supporting 512 dimensions, so that it matches the dimensions of the vector embeddings created with the CLIP model. Astra DB supports the use of ANN searches with a cosine, dot product, or Euclidean algorithm. For our purposes, a cosine-based ANN will be fine.
Next, we will define some constants to help our loader:
model = SentenceTransformer('clip-ViT-B-32')IMAGE_DIR = "static/images/"CSV = "gemstones_and_chakras.csv"model = SentenceTransformer('clip-ViT-B-32') IMAGE_DIR = "static/images/" CSV = "gemstones_and_chakras.csv"model = SentenceTransformer('clip-ViT-B-32') IMAGE_DIR = "static/images/" CSV = "gemstones_and_chakras.csv"
Enter fullscreen mode Exit fullscreen mode
These will instantiate the clip-ViT-B-32 model locally, define a location for our images, and data filename, respectively.
Now let’s open the CSV file in a with
block and initialize the data reader:
with open(CSV) as csvHandler:crystalData = csv.reader(csvHandler)# skip header rownext(crystalData)with open(CSV) as csvHandler: crystalData = csv.reader(csvHandler) # skip header row next(crystalData)with open(CSV) as csvHandler: crystalData = csv.reader(csvHandler) # skip header row next(crystalData)
Enter fullscreen mode Exit fullscreen mode
Our CSV file has a header row that we will skip at read-time. The next()
function (from Python’s CSV library) is an easy way to iterate over it.
With that complete, we can now use a for
loop to work through the remaining lines in the file. We will first read the line’s image
column. As our application is very image-centric, we do not want to spend time processing a line if it doesn’t have a valid image. We will use an if conditional to make sure that the file referenced by image
column is both:
- not empty
- a valid file that exists
for line in crystalData:image = line[1]# Only load crystals with imagesif image != "" and path.exists(IMAGE_DIR + image):# map columnsgemstone = line[0]alt_name = line[2]chakras = line[3]phys_attributes = line[4]emot_attributes = line[5]meta_attributes = line[6]origin = line[7]description = line[8]birth_month = line[9]zodiac_sign = line[10]mohs_hardness = line[11]for line in crystalData: image = line[1] # Only load crystals with images if image != "" and path.exists(IMAGE_DIR + image): # map columns gemstone = line[0] alt_name = line[2] chakras = line[3] phys_attributes = line[4] emot_attributes = line[5] meta_attributes = line[6] origin = line[7] description = line[8] birth_month = line[9] zodiac_sign = line[10] mohs_hardness = line[11]for line in crystalData: image = line[1] # Only load crystals with images if image != "" and path.exists(IMAGE_DIR + image): # map columns gemstone = line[0] alt_name = line[2] chakras = line[3] phys_attributes = line[4] emot_attributes = line[5] meta_attributes = line[6] origin = line[7] description = line[8] birth_month = line[9] zodiac_sign = line[10] mohs_hardness = line[11]
Enter fullscreen mode Exit fullscreen mode
If the image for each line in the CSV file is indeed valid, we will then map the remaining columns to local variables.
Two of our variables, chakras
and mohs_hardness
, will require some extra processing before being written into Astra DB. Our chakra data comes from the file as a comma-delimited list. Crystals can affect multiple chakras. Therefore, we will need to reconstruct it into an array with each item wrapped in quotation marks, so that it is recognized as valid JSON. To do that, we will simply replace the commas with double-quoted commas:
# reformat chakras to be more JSON-friendlychakras = chakras.replace(', ','","')# reformat chakras to be more JSON-friendly chakras = chakras.replace(', ','","')# reformat chakras to be more JSON-friendly chakras = chakras.replace(', ','","')
Enter fullscreen mode Exit fullscreen mode
This will not make it valid JSON on its own, so we will account for that later when we write the chakra data.
Precious stones all have a rating on the Mohs hardness scale, which indicates its resistance to scratches. While some crystals in our data set have a value of a single integer, several do occupy a range on the scale (with the minimum listed first), indicating a maximum and a minimum Mohs hardness. We will split-out these values, and store them as mohs_min_hardness
and mohs_max_hardness
, respectively. Do note that sometimes the mohs_hardness
column will have a value of “Variable” or “Varies,” so we will account for that possibility as well:
# split out minimum and maximum mohs hardressmh_list = mohs_hardness.split('-')mohs_min_hardness = 1.0mohs_max_hardness = 9.0if mh_list[0][0:4] != 'Vari':mohs_min_hardness = mh_list[0]mohs_max_hardness = mh_list[0]if len(mh_list) > 1:mohs_max_hardness = mh_list[1]# split out minimum and maximum mohs hardress mh_list = mohs_hardness.split('-') mohs_min_hardness = 1.0 mohs_max_hardness = 9.0 if mh_list[0][0:4] != 'Vari': mohs_min_hardness = mh_list[0] mohs_max_hardness = mh_list[0] if len(mh_list) > 1: mohs_max_hardness = mh_list[1]# split out minimum and maximum mohs hardress mh_list = mohs_hardness.split('-') mohs_min_hardness = 1.0 mohs_max_hardness = 9.0 if mh_list[0][0:4] != 'Vari': mohs_min_hardness = mh_list[0] mohs_max_hardness = mh_list[0] if len(mh_list) > 1: mohs_max_hardness = mh_list[1]
Enter fullscreen mode Exit fullscreen mode
With our data prepared, we can now build each crystal’s text and metadata properties:
metadata = (f"gemstone: {gemstone}")text = (<em>f</em>"gemstone: {gemstone}| alternate name: {alt_name}| physical attributes: {phys_attributes}| emotional attributes: {emot_attributes}| metaphysical attributes: {meta_attributes}| origin: {origin}| maximum mohs hardness: {mohs_max_hardness}| minimum mohs hardness: {mohs_min_hardness}")metadata = (f"gemstone: {gemstone}") text = (<em>f</em>"gemstone: {gemstone}| alternate name: {alt_name}| physical attributes: {phys_attributes}| emotional attributes: {emot_attributes}| metaphysical attributes: {meta_attributes}| origin: {origin}| maximum mohs hardness: {mohs_max_hardness}| minimum mohs hardness: {mohs_min_hardness}")metadata = (f"gemstone: {gemstone}") text = (<em>f</em>"gemstone: {gemstone}| alternate name: {alt_name}| physical attributes: {phys_attributes}| emotional attributes: {emot_attributes}| metaphysical attributes: {meta_attributes}| origin: {origin}| maximum mohs hardness: {mohs_max_hardness}| minimum mohs hardness: {mohs_min_hardness}")
Enter fullscreen mode Exit fullscreen mode
Next, we can load the crystal’s image using Pillow (Python’s image processing library) and generate a vector embedding for it with the encode()
function from our CLIP model
:
img_emb = model.encode(Image.open(IMAGE_DIR + image))img_emb = model.encode(Image.open(IMAGE_DIR + image))img_emb = model.encode(Image.open(IMAGE_DIR + image))
Enter fullscreen mode Exit fullscreen mode
With all that complete, we are ready to build our local JSON document as a string:
strJson = (f' {{"_id":"{image}","text":"{text}","chakra":["{chakras}"],"birth_month":"{birth_month}","zodiac_sign":"{zodiac_sign}","$vector":{str(img_emb.tolist())}}}')strJson = (f' {{"_id":"{image}","text":"{text}","chakra":["{chakras}"],"birth_month":"{birth_month}","zodiac_sign":"{zodiac_sign}","$vector":{str(img_emb.tolist())}}}')strJson = (f' {{"_id":"{image}","text":"{text}","chakra":["{chakras}"],"birth_month":"{birth_month}","zodiac_sign":"{zodiac_sign}","$vector":{str(img_emb.tolist())}}}')
Enter fullscreen mode Exit fullscreen mode
Finally, we can convert each crystal’s data to JSON and write it into Astra DB:
doc = json.loads(strJson)col.insert_one(doc)doc = json.loads(strJson) col.insert_one(doc)doc = json.loads(strJson) col.insert_one(doc)
Enter fullscreen mode Exit fullscreen mode
crystalSearch.py
To demonstrate the visual aspects of Crystal Search, we will stand-up a simple web application using Flask. This interface will have a few simple components, including dropdowns (for navigation) and a way to upload an image for searching.
Note: As web front-end development is not the focus, we’ll skip the implementation details. For those who are interested, the code can be accessed in the project repository listed at the end of this post.
astraConn.py
Now that our data has been loaded, we can build the Crystal Search application. First, we will construct the astraConn
module, which will act as an abstraction layer for our interactions with the Astra DB vector database. We will create a new file named astraConn.py
and add the following two imports:
import osfrom astrapy.db import AstraDBimport os from astrapy.db import AstraDBimport os from astrapy.db import AstraDB
Enter fullscreen mode Exit fullscreen mode
Next, we will pull-in our ASTRA_DB_APPLICATION_TOKEN
and ASTRA_DB_API_ENDPOINT
variables from our system environment, and instantiate them locally:
ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")ASTRA_DB_API_ENDPOINT= os.environ.get("ASTRA_DB_API_ENDPOINT")ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN") ASTRA_DB_API_ENDPOINT= os.environ.get("ASTRA_DB_API_ENDPOINT")ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN") ASTRA_DB_API_ENDPOINT= os.environ.get("ASTRA_DB_API_ENDPOINT")
Enter fullscreen mode Exit fullscreen mode
This module will have a few different methods that will be called by our application, but we won’t want to rebuild our database connection each time. Therefore, we will create two global variables (db
and collection
) to keep data pertaining to our database cached:
db = Nonecollection = Nonedb = None collection = Nonedb = None collection = None
Enter fullscreen mode Exit fullscreen mode
The first method that we will define will be the init_collection()
method. This method will be called by every other method in this module. It will first initiate global scope access for the db
and collection
variables. Its primary function will be to instantiate the db
object if it is null or “None.” This way, an existing connection object can be reused. The code for this method is shown below:
def init_collection(table_name):global dbglobal collectionif db is None:db = AstraDB(token=ASTRA_DB_APPLICATION_TOKEN,api_endpoint=ASTRA_DB_API_ENDPOINT,)collection = db.collection(table_name)def init_collection(table_name): global db global collection if db is None: db = AstraDB( token=ASTRA_DB_APPLICATION_TOKEN, api_endpoint=ASTRA_DB_API_ENDPOINT, ) collection = db.collection(table_name)def init_collection(table_name): global db global collection if db is None: db = AstraDB( token=ASTRA_DB_APPLICATION_TOKEN, api_endpoint=ASTRA_DB_API_ENDPOINT, ) collection = db.collection(table_name)
Enter fullscreen mode Exit fullscreen mode
Note that the collection
variable will be instantiated on every call. This allows us the flexibility to access different collections in Astra DB with the same database connection information.
For our application, there are three ways that we will perform reads on our data. We will search by vector, query by id, and then query by three additional properties that we are going to build into dropdowns in our web application.
First, we will build the get_by_vector()
method. This asynchronous method will accept a collection name, a vector embedding, and a maximum (limit)
number of results to be returned (defaulting to 1). After initializing our database and collection, we will invoke the vector_find()
method with the vector_embedding
, the limit
, and the list of fields from the collection that we want to receive. We will then return the results
to the calling method.
async def get_by_vector(collection_name, vector_embedding, limit=1):init_collection(collection_name)results = collection.vector_find(vector_embedding.tolist(), limit=limit, fields={"text","chakra","birth_month","zodiac_sign","$vector"})return resultsasync def get_by_vector(collection_name, vector_embedding, limit=1): init_collection(collection_name) results = collection.vector_find(vector_embedding.tolist(), limit=limit, fields={"text","chakra","birth_month","zodiac_sign","$vector"}) return resultsasync def get_by_vector(collection_name, vector_embedding, limit=1): init_collection(collection_name) results = collection.vector_find(vector_embedding.tolist(), limit=limit, fields={"text","chakra","birth_month","zodiac_sign","$vector"}) return results
Enter fullscreen mode Exit fullscreen mode
Our get_by_id()
method will be similar to the previous one, but will work quite differently under the hood. This method is also meant to be called asynchronously, and accepts a collection name as well as the identifier to be queried. As querying by a unique identifier is deterministic, we can invoke the find_one()
method with a filter for the specific id
, as shown below:
async def get_by_id(collection_name, id):init_collection(collection_name)result = collection.find_one(filter={"_id": id})return resultasync def get_by_id(collection_name, id): init_collection(collection_name) result = collection.find_one(filter={"_id": id}) return resultasync def get_by_id(collection_name, id): init_collection(collection_name) result = collection.find_one(filter={"_id": id}) return result
Enter fullscreen mode Exit fullscreen mode
This method will return a single JSON document as the result
.
Finally, get_by_dropdowns()
is an asynchronous method that will return all matching rows based on the values of three properties: chakras, birth month, and zodiac sign. First, we will build an array to hold our conditions
. This is necessary because not every dropdown is going to be used each time. That way we can dynamically build our conditions based on the state of the dropdowns at query-time.
async def get_by_dropdowns(collection_name, chakra, birth_month, zodiac_sign):
init_collection(collection_name)conditions = []if chakra != "--Chakra--":condition_chakra = {"chakra": {"$in": [chakra]}}conditions.append(condition_chakra)if birth_month != "--Birth Month--":condition_birth_month = {"birth_month": birth_month}conditions.append(condition_birth_month)if zodiac_sign != "--Zodiac Sign--":condition_zodiac_sign = {"zodiac_sign": zodiac_sign}conditions.append(condition_zodiac_sign)crystal_filter = ""if len(conditions) > 2:crystal_filter = {"$and": [{"$and": [conditions[0], conditions[1]]}, conditions[2]]}elif len(conditions) > 1:crystal_filter = {"$and": [conditions[0], conditions[1]]}elif len(conditions) > 0:crystal_filter = conditions[0]else:returnresults = collection.find(crystal_filter)return resultsinit_collection(collection_name) conditions = [] if chakra != "--Chakra--": condition_chakra = {"chakra": {"$in": [chakra]}} conditions.append(condition_chakra) if birth_month != "--Birth Month--": condition_birth_month = {"birth_month": birth_month} conditions.append(condition_birth_month) if zodiac_sign != "--Zodiac Sign--": condition_zodiac_sign = {"zodiac_sign": zodiac_sign} conditions.append(condition_zodiac_sign) crystal_filter = "" if len(conditions) > 2: crystal_filter = {"$and": [{"$and": [conditions[0], conditions[1]]}, conditions[2]]} elif len(conditions) > 1: crystal_filter = {"$and": [conditions[0], conditions[1]]} elif len(conditions) > 0: crystal_filter = conditions[0] else: return results = collection.find(crystal_filter) return resultsinit_collection(collection_name) conditions = [] if chakra != "--Chakra--": condition_chakra = {"chakra": {"$in": [chakra]}} conditions.append(condition_chakra) if birth_month != "--Birth Month--": condition_birth_month = {"birth_month": birth_month} conditions.append(condition_birth_month) if zodiac_sign != "--Zodiac Sign--": condition_zodiac_sign = {"zodiac_sign": zodiac_sign} conditions.append(condition_zodiac_sign) crystal_filter = "" if len(conditions) > 2: crystal_filter = {"$and": [{"$and": [conditions[0], conditions[1]]}, conditions[2]]} elif len(conditions) > 1: crystal_filter = {"$and": [conditions[0], conditions[1]]} elif len(conditions) > 0: crystal_filter = conditions[0] else: return results = collection.find(crystal_filter) return results
Enter fullscreen mode Exit fullscreen mode
Once the conditions
array is built, we can then build crystal_filter
to use as our JSON query string. To pass a filter with multiple conditions through Astra DB’s Data API, we need to build a nested conditional statement.
A single condition could be sent as a filter on its own. But two would need to use the $and
operator. If we were to hard-code our filter, it would be similar to this example:
crystal_filter = {"$and": [{"birth_month": "October"}, {"zodiac_sign": "Libra"}]}crystal_filter = {"$and": [{"birth_month": "October"}, {"zodiac_sign": "Libra"}]}crystal_filter = {"$and": [{"birth_month": "October"}, {"zodiac_sign": "Libra"}]}
Enter fullscreen mode Exit fullscreen mode
Of course, this also means that three conditions would require a nested $and
(one $and
inside of another), like this:
crystal_filter = {"$and": [{"$and": [{"birth_month": "October"}, {"zodiac_sign": "Libra"}]}, {"chakra": {"$in": ["Heart"]}}]}crystal_filter = {"$and": [{"$and": [{"birth_month": "October"}, {"zodiac_sign": "Libra"}]}, {"chakra": {"$in": ["Heart"]}}]}crystal_filter = {"$and": [{"$and": [{"birth_month": "October"}, {"zodiac_sign": "Libra"}]}, {"chakra": {"$in": ["Heart"]}}]}
Enter fullscreen mode Exit fullscreen mode
Note that as each crystal’s chakra
property is an array, we need to use the $in
operator.
crystalServices.py
Next, we will create a new file named crystalServices.py
with the following imports:
import jsonimport osfrom astraConn import get_by_vectorfrom astraConn import get_by_idfrom astraConn import get_by_dropdownsfrom sentence_transformers import SentenceTransformerfrom PIL import Imageimport json import os from astraConn import get_by_vector from astraConn import get_by_id from astraConn import get_by_dropdowns from sentence_transformers import SentenceTransformer from PIL import Imageimport json import os from astraConn import get_by_vector from astraConn import get_by_id from astraConn import get_by_dropdowns from sentence_transformers import SentenceTransformer from PIL import Image
Enter fullscreen mode Exit fullscreen mode
We will also define some local variables for our image directory, the name of our collection in Astra DB, and our CLIP model:
INPUT_IMAGE_DIR = "static/input_images/"DATA_COLLECTION_NAME = "crystal_data"model = NoneINPUT_IMAGE_DIR = "static/input_images/" DATA_COLLECTION_NAME = "crystal_data" model = NoneINPUT_IMAGE_DIR = "static/input_images/" DATA_COLLECTION_NAME = "crystal_data" model = None
Enter fullscreen mode Exit fullscreen mode
Our service layer will expose two asynchronous methods. The first method that we will build, will be named get_crystals_by_image
, and it will accept an image filename as a parameter. It will be primarily responsible for generating a vector embedding from an image, using the embedding to invoke a vector similarity search, and returning the results to the view. This method will need the model global variable, and instantiate it if required:
async def get_crystals_by_image(file_path):global modelif model is None:model = SentenceTransformer('clip-ViT-B-32')async def get_crystals_by_image(file_path): global model if model is None: model = SentenceTransformer('clip-ViT-B-32')async def get_crystals_by_image(file_path): global model if model is None: model = SentenceTransformer('clip-ViT-B-32')
Enter fullscreen mode Exit fullscreen mode
Next, we will define our result set variable as an empty dictionary. Then we will load the image, generate an embedding for it, and use it to call the get_by_vector()
method from (astraConn.py)
:
results = {}img_emb = model.encode(Image.open(INPUT_IMAGE_DIR + file_path))crystal_data = await get_by_vector(DATA_COLLECTION_NAME, img_emb, 3)if crystal_data is not None:for crystal in crystal_data:id = crystal['_id']results[id] = parse_crystal_data(crystal)return resultsresults = {} img_emb = model.encode(Image.open(INPUT_IMAGE_DIR + file_path)) crystal_data = await get_by_vector(DATA_COLLECTION_NAME, img_emb, 3) if crystal_data is not None: for crystal in crystal_data: id = crystal['_id'] results[id] = parse_crystal_data(crystal) return resultsresults = {} img_emb = model.encode(Image.open(INPUT_IMAGE_DIR + file_path)) crystal_data = await get_by_vector(DATA_COLLECTION_NAME, img_emb, 3) if crystal_data is not None: for crystal in crystal_data: id = crystal['_id'] results[id] = parse_crystal_data(crystal) return results
Enter fullscreen mode Exit fullscreen mode
Finally, we will process and return the vector search results. Note that the parse_crystal_data()
method does much of the heavy-lifting of building the result set. We will construct that method toward the end of this module.
We will now move on to the get_crystals_by_facets()
method. This method accepts the values taken from three dropdown lists containing data for chakras, birth month, and zodiac sign. Similar to the prior method, we will define an empty dictionary for the results and perform a query on our data, before processing and returning the results
:
async def get_crystals_by_facets(chakra, birth_month, zodiac_sign):results = {}crystal_data = await get_by_dropdowns(DATA_COLLECTION_NAME, chakra, birth_month, zodiac_sign)if crystal_data is not None:for crystal in crystal_data['data']['documents']:id = crystal['_id']results[id] = parse_crystal_data(crystal)return resultsasync def get_crystals_by_facets(chakra, birth_month, zodiac_sign): results = {} crystal_data = await get_by_dropdowns(DATA_COLLECTION_NAME, chakra, birth_month, zodiac_sign) if crystal_data is not None: for crystal in crystal_data['data']['documents']: id = crystal['_id'] results[id] = parse_crystal_data(crystal) return resultsasync def get_crystals_by_facets(chakra, birth_month, zodiac_sign): results = {} crystal_data = await get_by_dropdowns(DATA_COLLECTION_NAME, chakra, birth_month, zodiac_sign) if crystal_data is not None: for crystal in crystal_data['data']['documents']: id = crystal['_id'] results[id] = parse_crystal_data(crystal) return results
Enter fullscreen mode Exit fullscreen mode
There are also two additional code blocks required to more easily transfer our data back up to the view layer. The first is the parse_crystal_data()
method. This method is fairly straightforward in that it takes the raw crystal data as a parameter, and processes each property into a new object of the Crystal class. As the final part of this module, we also need to add the Crystal object class. They will not be shown here, but both of these definitions can be found at the end of the crystalServices.py module.
Demo
Let’s see this in action. We will run the application with Flask. The complete code listed above (including all of the front end components) can be found in this GitHub repository.
To run the application, we will use the following command:
flask run -p 8080flask run -p 8080flask run -p 8080
Enter fullscreen mode Exit fullscreen mode
If it starts correctly, Flask should display the application name, address and port that it is bound to:
* Serving Flask app 'crystalSearch'* Debug mode: offWARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.* Running on http://127.0.0.1:8080Press CTRL+C to quit* Serving Flask app 'crystalSearch' * Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on http://127.0.0.1:8080 Press CTRL+C to quit* Serving Flask app 'crystalSearch' * Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on http://127.0.0.1:8080 Press CTRL+C to quit
Enter fullscreen mode Exit fullscreen mode
If we navigate to that address in a browser, we should see a simple web page with a search interface at the top, and three differently-colored dropdowns in the left navigation. If we select values for the dropdowns and click on the “Find Crystals” button, we should see crystals matching those values returned (Figure 2).
Figure 2 – Results for crystals matching the dropdown values where chakra is “Heart”, birth month is “October,” and zodiac sign is “Libra.”
Of course, we can also search with an image. Perhaps we have a picture of a crystal that we cannot identify. We can click on the “Choose File” button, select our image, and then click “Search” to see what the closest matches are. If our picture is of a black obsidian crystal, we will see results similar to Figure 3.
Figure 3 – Results for crystals matching our image of a black obsidian crystal.
Conclusion
In this article, we have demonstrated another possible use case for an image-based search built with RAGStack and Astra DB. We walked through this very unique use case, how to configure the development environment, load and query data using CLIP, and build an application to leverage image-based vector embeddings. We also showed how to use the Astra DB Data API to implement a simple product faceting approach using dropdowns.
As the world continues to embrace GenAI, we will surely see more and more creative use cases spanning multiple industries. Searching by images using CLIP is one of the ways in which we are pushing the boundaries of conventional data applications. With solutions like RAGStack and Astra DB, DataStax continues to help you build the next generation of applications.
Do you have an idea for a great use of GenAI? Pull down RAGStack and start using Astra DB with a free account today!
原文链接:How to Build a Crystal Image Search App with Vector Search
暂无评论内容