The Visual Search Engine GUI
Before we start, I encourage cloning the repository from https://github.com/frankcholula/cvpr and reference the code there.
A streamlined version of this project is currently live on Streamlit at visual-search.streamlit.app. I'll publish the complete version after the assignment deadline to comply with the University of Surrey's academic integrity policy. I’m a big fan of Grant Sanderson's emphasis on visual engagement and intuition in learning, so I hope this project becomes a valuable learning resource and reference for future students.
If you're curious about how I built this project, I'll be releasing a separate guide and working with https://surreycompsoc.org/ to host a workshop. For the graders, bare with me and I’ll try best to make this lengthy report as painless to read as possible. Also, for the code, you'll also find comprehensive unit tests here.
Digital image collections are traditionally searched using textual queries (e.g., Google). However, searching based on visual appearance is often desirable (e.g., Google Lens)—for instance, when recommending visually-similar products in online shopping. Visual search systems address this need by allowing users to search for images based on their visual characteristics rather than relying solely on textual descriptions.
In the context of image search, we must first define what makes two images "similar." Images are considered similar based on a combination of visual content and semantic content.
Google’s Search by image
option
Amazon’s Similar items
recommendation
Visual content of an image includes color, texture, and shape, whereas semantic content focuses on objects, context, and the relationships between these objects.
More specifically, we have established three levels of image similarity in lecture 7 [2]:
As we progress through these levels, we observe an increase in semantic content that is, in essence, more human-like. The semantic gap refers to the discrepancy between the low-level features extractable through computer vision and the high-level similarity definitions desired in content-based image retrieval.
graph
A[Level 1: Low-level visual features]
A --> B[Level 2: Similar semantic content]
B --> C[Level 3: High-level concepts]
In this report, we'll progress from extracting low-level visual features to exploring more complex semantic content. We'll then use these extracted descriptors to construct a visual search system, comparing various distance metrics. Finally, we'll evaluate our system's performance using the results of the the output images. This approach aims to bridge the semantic gap between computer and human perception of image similarity, offering insights into how we can improve our visual search system.
ClickMe.html
description of the Image Database
the Provided Legend from the MSRC V2 dataset
We'll be using the MSRC-v2 Image Database, provided by Microsoft Research Cambridge [6]. This dataset contains 591 images across 20 classes. While each image belongs to a single class, it can feature multiple objects. Although the dataset doesn't come with pre-assigned labels, we can generate them using the provided legend and ground truth images, which are properly segmented versions of the original images.
Ground truth image of image 16_14_s.bmp
. It belongs to class 16
with the labels [”dog”, “road”]
A more complicated Image 20_14_s.bmp
of class 20
with the labels ["water", "boat", "tree", "sky", "building", "face", "body"]
Conveniently, I've created a labels.json
file containing each image's class and labels. This was done by detecting pixel colors in the corresponding ground truth images. This file will be useful in evaluating our visual search system's performance. You can reference the code here [1]. The gist of the logic is implemented in the ImageLabeler
class here [1].
{
"17_15_s.bmp": {
"labels": [
"building",
"road",
"sky"
],
"class": "17"
}
}
The resulting JSON representation of one image is shown on the left.
For instance, the image 17_15_s.bmp
belongs to class 17
and contains the labels [building, road, sky]
. Essentially, you can think of a class as a superset of the labels.
If you'd like to examine the full labels.json
file, you can find it in my Github repository.
graph LR
subgraph B[Visual Search System]
B1[Extract Descriptor]
B2[Compare with other Descriptors]
B3[Choose **N** Most Similiar Descriptors]
B4[Map Descriptors back to Images]
end
A[Input Image] --> B1
B1 --> B2
B2 --> B3
B3 --> B4
B4 --> C1[Similar Image 1]
B4 --> C2[Similar Image 2]
B4 --> C3[Similar Image 3]
B4 --> C4[...]
B4 --> CN[Similar Image N]
As illustrated in the pipeline above, the Visual Search System (VSS) accepts an input image and generates N
similar images as output.
We can evaluate our system by examining either the classes or the labels of the retrieved images. While we'll primarily assess the system's performance using a class-based approach, it's important to also consider the label-based method. The label-based approach provides more in-depth information, especially given that the class designations are somewhat arbitrary for the MSRC image set we're using (as you will see in the callout in the conclusion).