Image Similarity with PgVector

At FedGeoDay in April, I attended a workshop conducted by my friend Steve Pousty on the use of vector databases. Steve is a great teacher and I learned a lot that day. I’ve been on a bit of a journey since. I’ve gotten some project work with AI and I find vector databases to be the most intriguing part of an AI tech stack.

Vector databases are a component of many generative AI systems and support tasks such as semantic search and contextual retrieval via techniques such as retrieval augmented generation (RAG). They are highly optimized for similarity search as opposed to inverted index searches like Google which are optimized to find exact matches. Vector databases, with their ability to search on semantic similarity, are useful for tasks such as finding documents that “talk around” a topic without explicitly mentioning it.

Vector databases get their name from the fact that they convert data into mathematical vectors which are embedded into vector space. Vectors can be defined by an large number of dimensions. These are mathematical dimensions, similar in concept to the three spatial dimensions that humans can perceive, but vastly different in practicality. The same is true for the “space” they are embedded into. Much like Tobler’s First Law in our 3D spatial world, similarity is measured by proximity in vector space. The closer two vectors are to each other, the more similar to each other they are. That’s a little bit of an oversimplification, but I’ll leave it there in the interest of progressing with the topic of this post.

One aspect of vector databases I find interesting is the ability to analyze embeddings created directly from binary data, such as images. This opens up the ability to say “Show me images that look like this one.” I’ll work through a very basic example that I put together, but first, I’ll cover some basics.

The model matters. Steve told me to start with pre-trained models and fine-tune as necessary. He was right. As recently as five years ago, the big push was for training data. Now, thousands of pre-trained models are available on HuggingFace – the Github for AI models. For this example, I used a pre-trained Resnet-50 model, mainly because it works well with the machine on which I am writing this post. It is a 50-layer convolutional neural network (CNN) which, in this case, has already been trained.

The model also matters for querying. Vector database inputs are encoded to vector for querying, so you have to use the same model. This is similar to the idea that you have to query a SQL database using the same language as the data. You can’t use string literals in Spanish to query data in English and expect to get useful results. The same is true for vector databases.

Moving on, I set up a basic workflow:

  1. Acquire the images
  2. Embed the images
  3. Store the embeddings
  4. Query the embeddings using an input image

I’ll walk through each step next, but what I am attempting to accomplish with this example is find the top three most similar images to my input image from my vector database. In this case, I will be using pgVector, a vector database extension to PostgreSQL. I will be embedding images of beach, desert, glacier, prairie, and rainforest.

Acquire the Images

In this case, I decided to scrape some images from Wikimedia Commons using the categories above as query strings. I wrote a Python script that issues the query and pulls the first ten images for each category. This screenshot shows some of the images that were pulled.

The code for this step is at the end of the post.

Generating and Storing Embedding

The next step is to create embeddings from the downloaded images and store those embeddings in pgVector. 

Embedding images involves converting them into a format that a computer can easily compare and analyze. This process starts with a pre-trained model like ResNet-50, which has already learned to recognize various features in images, such as shapes, colors, and textures. When an image is fed into the model, it breaks down the image into a vector representing these features. It is similar to creating a unique fingerprint for the image. This vector captures the essential characteristics of the image in a way that allows a computer to compare it with other images. By storing these vectors in a database, we can quickly find and retrieve images that are similar based on their embedded features, even if they don’t look exactly the same.

The first step here is to create a table to hold the embeddings. This assumes you have installed the pgvector extension. Here is the DDL for the table that I used

public.image_embeddings definition
Drop table
DROP TABLE public.image_embeddings;
CREATE TABLE public.image_embeddings (
id serial4 NOT NULL,
image_path text NULL,
embedding public.vector NULL,
CONSTRAINT image_embeddings_pkey PRIMARY KEY (id)
);

Next, I wrote a python script to create embeddings for each image I downloaded previously. As mentioned above, this code uses the Resnet-50 model, which can be found on Huggingface. The workflow is simple, create the embeddings in Python and then execute an INSERT statement using psycopg2.

The code for this step is at the end of this post.

Querying Embeddings Using an Image

So far, the workflows has been fairly straightforward: Get the data, encode the data, store the data.

Now I’m going to query the data. I’ll take another image, encode it using Resnet-50, and run a SELECT statement through psycopg2 to find the three most similar images from those I downloaded in the first step. I’ll be looking for desert photos by querying with this image:

As I said previously, the model matters. I have to use the same model to encode the input image. The script does that and then passes the resulting embedding through to PgVector. In this case, the query uses the ‘<->’ operator to calculate the Euclidean (straight-line) distance between the input embedding and the stored embeddings. Using the ‘<=>’ would cause the query to use cosine similarity.

With the sample data, the three most similar images were:

Those results were pretty good and you can see how each image has aspects of similarity. Similarity is based on features such as edges, textures, patterns and more which are broken down into layers. The Resnet-50 model supports up to 50 layers, and it can perform fine-grained similarity search.

The code for this step is at the end of this post.

Wrapping Up

Vector databases are an incredibly interesting technology to me and I am continuing my journey with them. This post shows some of my first steps, but I’m excited about learning and doing more. Often, the top-level technology that gets the hype, such as AI, is supported by some compelling technology that gets less attention.

I’m particularly intrigued by PgVector because of its ability to put vectors alongside scalar data and enable hybrid search scenarios in a way that is familiar. I’m looking forward to doing more.

Code For This Post

Acquiring Images:

import requests
from bs4 import BeautifulSoup
import os
qry = "glacier" #wikimedia commons query
# Define the search URL
search_url = f"https://commons.wikimedia.org/w/index.php?search={qry}&title=Special:MediaSearch&go=Go&type=image"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
# Function to download an image
def download_image(image_url, save_path):
try:
response = requests.get(image_url, headers=headers, stream=True)
response.raise_for_status()
with open(save_path, 'wb') as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
print(f"Downloaded {save_path}")
except Exception as e:
print(f"Failed to download {image_url}. Error: {e}")
# Function to get image URLs from Wikimedia Commons search results
def get_image_urls(search_url, max_images=10):
response = requests.get(search_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
image_tags = soup.find_all('img', {'class': 'sd-image'}, limit=max_images)
image_urls = [img['src'] for img in image_tags]
return image_urls
# Directory to save images
save_dir = "sample_images"
os.makedirs(save_dir, exist_ok=True)
# Get image URLs
image_urls = get_image_urls(search_url)
# Download images
for idx, image_url in enumerate(image_urls):
save_path = os.path.join(save_dir, f"{qry}_{idx+1}.jpg")
download_image(image_url, save_path)

Generating and Storing Embeddings:

import os
import psycopg2
from PIL import Image
import torch
from torchvision import models, transforms
# Database configuration
DB_NAME = "database"
DB_USER = "user"
DB_PASSWORD = "password"
DB_HOST = "host"
DB_PORT = "port"
# Folder containing images
IMAGE_FOLDER = "/path/to/sample_images"
# Load pre-trained ResNet model
model = models.resnet50(pretrained=True)
model.eval() # Set to evaluation mode
# Remove the final classification layer to get embeddings
model = torch.nn.Sequential(*list(model.children())[:-1])
# Image preprocessing
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
def get_image_embedding(image_path):
image = Image.open(image_path).convert("RGB")
image_tensor = preprocess(image).unsqueeze(0)
with torch.no_grad():
embedding = model(image_tensor)
return embedding.squeeze().numpy()
def save_embedding_to_db(image_name, embedding):
conn = psycopg2.connect(
dbname=DB_NAME, user=DB_USER, password=DB_PASSWORD, host=DB_HOST, port=DB_PORT
)
cursor = conn.cursor()
cursor.execute(
"INSERT INTO image_embeddings (image_path, embedding) VALUES (%s, %s)",
(image_name, embedding.tolist())
)
conn.commit()
cursor.close()
conn.close()
def process_images(image_folder):
for image_name in os.listdir(image_folder):
image_path = os.path.join(image_folder, image_name)
if os.path.isfile(image_path) and image_name.lower().endswith(('.png', '.jpg', '.jpeg')):
embedding = get_image_embedding(image_path)
save_embedding_to_db(image_path, embedding)
if __name__ == "__main__":
process_images(IMAGE_FOLDER)
view raw resnet_embed.py hosted with ❤ by GitHub

Querying Images:

import os
import psycopg2
import numpy as np
from PIL import Image
from scipy.spatial.distance import cosine
import torch
from torchvision import models, transforms
# Database configuration
DB_NAME = "database"
DB_USER = "user"
DB_PASSWORD = "password"
DB_HOST = "host"
DB_PORT = "port"
# Load pre-trained ResNet model
model = models.resnet50(pretrained=True)
model.eval() # Set to evaluation mode
# Remove the final classification layer to get embeddings
model = torch.nn.Sequential(*list(model.children())[:-1])
# Image preprocessing
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
def get_image_embedding(image_path):
image = Image.open(image_path).convert("RGB")
image_tensor = preprocess(image).unsqueeze(0)
with torch.no_grad():
embedding = model(image_tensor)
return embedding.squeeze().numpy()
def find_most_similar_images(query_embedding, top_n=3):
conn = psycopg2.connect(
dbname=DB_NAME, user=DB_USER, password=DB_PASSWORD, host=DB_HOST, port=DB_PORT
)
cursor = conn.cursor()
# Convert query_embedding to a list to store in PostgreSQL
query_embedding_list = query_embedding.tolist()
qry = f"SELECT image_path, embedding <-> '{query_embedding_list}' AS distance FROM image_embeddings ORDER BY embedding <-> '{query_embedding_list}' LIMIT {top_n};"
#print(qry)
# Execute the similarity search query
cursor.execute(
qry
)
similar_images = cursor.fetchall()
cursor.close()
conn.close()
return similar_images
def main(query_image_path):
query_embedding = get_image_embedding(query_image_path)
similar_images = find_most_similar_images(query_embedding)
print("Most similar images:")
for image_path, similarity in similar_images:
print(f"Image: {image_path}, Distance: {similarity}")
if __name__ == "__main__":
# Replace with the path to the query image
query_image_path = "path/to/your/query/image.jpg"
main(query_image_path)
view raw resnet_query.py hosted with ❤ by GitHub