Abstract
As AI continues to impact many types of data processing, vector embeddings have also emerged as a powerful tool for video analysis. This article delves into some of the capabilities of AI in analysing video data. We’ll explore how vector embeddings, created using Python and OpenAI CLIP, can be used to interpret and analyse video content.
The notebook file used in this article is available on GitHub.
Introduction
This article discusses the significance of vector embeddings in video analysis, offering a step-by-step guide to building these embeddings using a simple example.
Create a SingleStore Cloud account
A previous article showed the steps to create a free SingleStore Cloud account. We’ll use the Free Shared Tier and take the default names for the Workspace and Database.
Import the notebook
We’ll download the notebook from GitHub.
From the left navigation pane in the SingleStore cloud portal, we’ll select DEVELOP > Data Studio.
In the top right of the web page, we’ll select New Notebook > Import From File. We’ll use the wizard to locate and import the notebook we downloaded from GitHub.
Run the notebook
After checking that we are connected to our SingleStore workspace, we’ll run the cells one by one.
We’ll start by downloading an example video from GitHub and then playing the short video directly in the notebook. The example video is 142 seconds long.
Contrastive Language-Image Pretraining (CLIP) is a model by OpenAI that understands both images and text by associating them in a shared embedding space. We’ll load it as follows:
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device = device)
Enter fullscreen mode Exit fullscreen mode
We’ll break down a video into its individual picture frames, as follows:
def extract_frames(video_path):
frames = []
cap = cv2.VideoCapture(video_path)
frame_rate = cap.get(cv2.CAP_PROP_FPS)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
total_seconds = total_frames / frame_rate
target_frame_count = int(total_seconds)
target_frame_index = 0
for i in range(target_frame_count):
cap.set(cv2.CAP_PROP_POS_FRAMES, target_frame_index)
ret, frame = cap.read()
if not ret:
break
frames.append(frame)
target_frame_index += int(frame_rate)
cap.release()
return frames
Enter fullscreen mode Exit fullscreen mode
Next, we’ll summarise what’s happening in a picture in a simpler form:
def generate_embedding(frame):
frame_tensor = preprocess(PILImage.fromarray(frame)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(frame_tensor).cpu().numpy()
return embedding[0]
Enter fullscreen mode Exit fullscreen mode
We’ll now extract and summarise visual information from a video into a structured format for further analysis:
def store_frame_embedding_and_image(video_path):
frames = extract_frames(video_path)
data = [
(i+1, generate_embedding(frame), frame)
for i, frame in enumerate(tqdm(
frames,
desc = "Processing frames",
bar_format = "{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}{postfix}]")
)
]
return pd.DataFrame(data, columns = ["frame_number", "embedding_data", "frame_data"])
Enter fullscreen mode Exit fullscreen mode
Let’s examine the size characteristics of the data stored in the DataFrame:
embedding_lengths = df["embedding_data"].str.len()
frame_lengths = df["frame_data"].str.len()
# Calculate min and max lengths for embeddings and frames min_embedding_length, max_embedding_length = embedding_lengths.min(), embedding_lengths.max()
min_frame_length, max_frame_length = frame_lengths.min(), frame_lengths.max()
# Print results print(f"Min length of embedding vectors: {min_embedding_length}")
print(f"Max length of embedding vectors: {max_embedding_length}")
print(f"Min length of frame data vectors: {min_frame_length}")
print(f"Max length of frame data vectors: {max_frame_length}")
Enter fullscreen mode Exit fullscreen mode
Example output:
Min length of embedding vectors: 512
Max length of embedding vectors: 512
Min length of frame data vectors: 1080
Max length of frame data vectors: 1080
Enter fullscreen mode Exit fullscreen mode
Now, let’s quantify how similar the query embedding is to each frame’s embedding in the DataFrame, providing a measure of similarity between a query and the frames:
def calculate_similarity(query_embedding, df):
# Convert the query embedding to a tensor query_tensor = torch.tensor(query_embedding, dtype = torch.float32).to(device)
# Convert the list of embeddings to a numpy array embeddings_np = np.array(df["embedding_data"].tolist())
# Create a tensor from the numpy array embeddings_tensor = torch.tensor(embeddings_np, dtype = torch.float32).to(device)
# Compute similarities using matrix multiplication similarities = torch.mm(embeddings_tensor, query_tensor.unsqueeze(1)).squeeze().tolist()
return similarities
Enter fullscreen mode Exit fullscreen mode
Now, we’ll summarise the meaning of a text query in a simpler numerical form:
def encode_text_query(query):
# Tokenize the query text tokens = clip.tokenize([query]).to(device)
# Compute text features using the pretrained model with torch.no_grad():
text_features = model.encode_text(tokens)
# Convert the tensor to a NumPy array and return it return text_features.cpu().numpy().flatten()
Enter fullscreen mode Exit fullscreen mode
and enter the query string “Ultra-Fast Ingestion” when prompted:
query = input("Enter your query: ")
text_query_embedding = encode_text_query(query)
text_query_embedding /= np.linalg.norm(text_query_embedding)
text_similarities = calculate_similarity(text_query_embedding, df)
df["text_similarity"] = text_similarities
Enter fullscreen mode Exit fullscreen mode
We’ll now get the top 5 best text matches:
# Retrieve the top 5 text matches based on similarity top_text_matches = df.nlargest(5, "text_similarity")
print("Top 5 best matches:")
print(top_text_matches[["frame_number", "text_similarity"]].to_string(index = False))
Enter fullscreen mode Exit fullscreen mode
Example output:
Top 5 best matches:
frame_number text_similarity
40 0.346581
39 0.345179
43 0.301896
53 0.298285
52 0.294805
Enter fullscreen mode Exit fullscreen mode
We can also plot the frames:
def plot_frames(frames, frame_numbers):
num_frames = len(frames)
fig, axes = plt.subplots(1, num_frames, figsize = (15, 5))
for ax, frame_data, frame_number in zip(axes, frames, frame_numbers):
ax.imshow(frame_data)
ax.set_title(f"Frame {frame_number}")
ax.axis("off")
plt.tight_layout()
plt.show()
# Collect frame data and numbers for the top text matches top_text_matches_indices = top_text_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_text_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_text_matches_indices]
# Plot the frames plot_frames(frames, frame_numbers)
Enter fullscreen mode Exit fullscreen mode
Now, we’ll summarise an image query in a simpler numerical form:
def encode_image_query(image):
# Preprocess the image and add batch dimension image_tensor = preprocess(image).unsqueeze(0).to(device)
# Extract features using the model with torch.no_grad():
image_features = model.encode_image(image_tensor)
# Convert features to NumPy array and flatten return image_features.cpu().numpy().flatten()
Enter fullscreen mode Exit fullscreen mode
and download an example image to use for a query:
image_url = "https://github.com/VeryFatBoy/clip-demo/raw/main/thumbnails/1_what_makes_singlestore_unique.png"
response = requests.get(image_url)
if response.status_code == 200:
display(Image(url = image_url))
image_file = PILImage.open(BytesIO(response.content))
image_query_embedding = encode_image_query(image_file)
image_query_embedding /= np.linalg.norm(image_query_embedding)
image_similarities = calculate_similarity(image_query_embedding, df)
df["image_similarity"] = image_similarities
else:
print("Failed to download the image, status code:", response.status_code)
Enter fullscreen mode Exit fullscreen mode
We’ll now get the top 5 best image matches:
top_image_matches = df.nlargest(5, "image_similarity")
print("Top 5 best matches:")
print(top_image_matches[["frame_number", "image_similarity"]].to_string(index = False))
Enter fullscreen mode Exit fullscreen mode
Example output:
Top 5 best matches:
frame_number image_similarity
7 0.877372
6 0.607051
9 0.591181
4 0.513214
15 0.502777
Enter fullscreen mode Exit fullscreen mode
We can also plot the frames:
# Collect frame data and numbers for the top image matches top_image_matches_indices = top_image_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_image_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_image_matches_indices]
# Plot the frames plot_frames(frames, frame_numbers)
Enter fullscreen mode Exit fullscreen mode
Now let’s combine both text and image by using element-wise averaging:
combined_query_embedding = (text_query_embedding + image_query_embedding) / 2
combined_similarities = calculate_similarity(combined_query_embedding, df)
df["combined_similarity"] = combined_similarities
Enter fullscreen mode Exit fullscreen mode
We’ll now get the top 5 best combined matches:
top_combined_matches = df.nlargest(5, "combined_similarity")
print("Top 5 best matches:")
print(top_combined_matches[["frame_number", "combined_similarity"]].to_string(index = False))
Enter fullscreen mode Exit fullscreen mode
Example output:
Top 5 best matches:
frame_number combined_similarity
7 0.516626
6 0.413325
9 0.380147
4 0.363691
3 0.355250
Enter fullscreen mode Exit fullscreen mode
We can also plot the frames:
# Collect frame data and numbers for the top combined matches top_combined_matches_indices = top_combined_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_combined_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_combined_matches_indices]
# Plot the frames plot_frames(frames, frame_numbers)
Enter fullscreen mode Exit fullscreen mode
Next, we’ll store the data in SingleStore. First, we’ll prepare the data:
frames_df = df.copy()
frames_df.drop(
columns = ["text_similarity", "image_similarity", "combined_similarity"],
inplace = True
)
query_string = combined_query_embedding.copy()
Enter fullscreen mode Exit fullscreen mode
We’ll also need to perform a little data cleanup:
def process_data(arr):
return np.array2string(arr, separator = ",").replace("\n", "")
frames_df["embedding_data"] = frames_df["embedding_data"].apply(process_data)
frames_df["frame_data"] = frames_df["frame_data"].apply(process_data)
query_string = process_data(query_string)
Enter fullscreen mode Exit fullscreen mode
We’ll check if we are running on the Free Shared Tier:
shared_tier_check = %sql SHOW VARIABLES LIKE "is_shared_tier"
if not shared_tier_check or shared_tier_check[0][1] == "OFF":
%sql DROP DATABASE IF EXISTS video_db;
%sql CREATE DATABASE IF NOT EXISTS video_db;
Enter fullscreen mode Exit fullscreen mode
and then get a connection to the database:
from sqlalchemy import *
db_connection = create_engine(connection_url)
Enter fullscreen mode Exit fullscreen mode
We’ll ensure a table is available to store the data:
DROP TABLE IF EXISTS frames;
CREATE TABLE IF NOT EXISTS frames (
frame_number INT(10) UNSIGNED NOT NULL,
embedding_data VECTOR(512) NOT NULL,
frame_data TEXT,
KEY(frame_number)
);
Enter fullscreen mode Exit fullscreen mode
and then write the DataFrame to SingleStore:
frames_df.to_sql(
"frames",
con = db_connection,
if_exists = "append",
index = False,
chunksize = 1000
)
Enter fullscreen mode Exit fullscreen mode
We can read some data back from SingleStore:
SELECT frame_number,
SUBSTRING(embedding_data, 1, 50) AS embedding_data,
SUBSTRING(frame_data, 1, 50) AS frame_data
FROM frames
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode
We can also create an ANN index:
ALTER TABLE frames ADD VECTOR INDEX (embedding_data)
INDEX_OPTIONS '{ "index_type":"AUTO", "metric_type":"DOT_PRODUCT" }';
Enter fullscreen mode Exit fullscreen mode
First, let’s run a query without using the ANN index:
SELECT frame_number,
embedding_data <*> :query_string AS similarity
FROM frames
ORDER BY similarity USE INDEX () DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode
Example output:
frame_number similarity
7 0.5166257619857788
6 0.4133252203464508
9 0.38014671206474304
4 0.36369115114212036
3 0.35524997115135193
Enter fullscreen mode Exit fullscreen mode
Now, we’ll run a query using the ANN index:
SELECT frame_number,
embedding_data <*> :query_string AS similarity
FROM frames
ORDER BY similarity DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode
Example output:
frame_number similarity
7 0.5166257619857788
6 0.4133252203464508
9 0.38014671206474304
4 0.36369115114212036
3 0.35524997115135193
Enter fullscreen mode Exit fullscreen mode
We can also use Python as an alternative:
sql_query = """ SELECT frame_number, embedding_data, frame_data FROM frames ORDER BY embedding_data <*> %s DESC LIMIT 5; """
new_frames_df = pd.read_sql(
sql_query,
con = db_connection,
params = (query_string,)
)
new_frames_df.head()
Enter fullscreen mode Exit fullscreen mode
Since we are only storing a small quantity of data (142 rows), the results are identical whether we use the ANN index or not. Our results from querying the database agree with our earlier results for the combined query.
Summary
In this article, we applied vector embeddings for video analysis using Python and OpenAI’s CLIP model. We saw how to extract frames from a video, generate embeddings for each frame, and use these embeddings to perform similarity searches based on text and image queries. This allowed us to retrieve relevant video segments, making it a useful tool for video content analysis.
Today, many modern LLMs are offering multimodal capabilities and quite extensive support for audio, images, and video. However, the example in this article showed that it is possible to use freely available software to achieve some of the same capabilities.
原文链接:Quick tip: Build Vector Embeddings for Video via Python Notebook & OpenAI CLIP
暂无评论内容