Table of Contents

Introduction 👋

The Visual Search Engine GUI

Before we start, I encourage cloning the repository from https://github.com/frankcholula/cvpr and reference the code there.

A streamlined version of this project is currently live on Streamlit at visual-search.streamlit.app. I'll publish the complete version after the assignment deadline to comply with the University of Surrey's academic integrity policy. I’m a big fan of Grant Sanderson's emphasis on visual engagement and intuition in learning, so I hope this project becomes a valuable learning resource and reference for future students.

If you're curious about how I built this project, I'll be releasing a separate guide and working with https://surreycompsoc.org/ to host a workshop. For the graders, bare with me and I’ll try best to make this lengthy report as painless to read as possible. Also, for the code, you'll also find comprehensive unit tests here.

Abstract 💭

Digital image collections are traditionally searched using textual queries (e.g., Google). However, searching based on visual appearance is often desirable (e.g., Google Lens)—for instance, when recommending visually-similar products in online shopping. Visual search systems address this need by allowing users to search for images based on their visual characteristics rather than relying solely on textual descriptions.

In the context of image search, we must first define what makes two images "similar." Images are considered similar based on a combination of visual content and semantic content.

Google’s option

Google’s Search by image option

Amazon’s recommendation

Amazon’s Similar items recommendation

Visual content of an image includes color, texture, and shape, whereas semantic content focuses on objects, context, and the relationships between these objects.

More specifically, we have established three levels of image similarity in lecture 7 [2]:

Level 1: Low-level visual features such as texture and color
Level 2: Similar semantic content such as an object or an activity
Level 3: High-level concepts such as war, comedy, news, etc.

As we progress through these levels, we observe an increase in semantic content that is, in essence, more human-like. The semantic gap refers to the discrepancy between the low-level features extractable through computer vision and the high-level similarity definitions desired in content-based image retrieval.

graph
    A[Level 1: Low-level visual features]
    A --> B[Level 2: Similar semantic content]
    B --> C[Level 3: High-level concepts]

In this report, we'll progress from extracting low-level visual features to exploring more complex semantic content. We'll then use these extracted descriptors to construct a visual search system, comparing various distance metrics. Finally, we'll evaluate our system's performance using the results of the the output images. This approach aims to bridge the semantic gap between computer and human perception of image similarity, offering insights into how we can improve our visual search system.

About the Dataset

description of the Image Database

ClickMe.html description of the Image Database

the Provided Legend from the MSRC V2 dataset

We'll be using the MSRC-v2 Image Database, provided by Microsoft Research Cambridge [6]. This dataset contains 591 images across 20 classes. While each image belongs to a single class, it can feature multiple objects. Although the dataset doesn't come with pre-assigned labels, we can generate them using the provided legend and ground truth images, which are properly segmented versions of the original images.

Ground truth image of image . It belongs to class with the labels

Ground truth image of image 16_14_s.bmp. It belongs to class 16 with the labels [”dog”, “road”]

A more complicated Image of class with the labels

A more complicated Image 20_14_s.bmp of class 20 with the labels ["water", "boat", "tree", "sky", "building", "face", "body"]

Conveniently, I've created a labels.json file containing each image's class and labels. This was done by detecting pixel colors in the corresponding ground truth images. This file will be useful in evaluating our visual search system's performance. You can reference the code here [1]. The gist of the logic is implemented in the ImageLabeler class here [1].

{
    "17_15_s.bmp": {
        "labels": [
            "building",
            "road",
            "sky"
        ],
        "class": "17"
    }
}

The resulting JSON representation of one image is shown on the left.

For instance, the image 17_15_s.bmp belongs to class 17 and contains the labels [building, road, sky]. Essentially, you can think of a class as a superset of the labels.

If you'd like to examine the full labels.json file, you can find it in my Github repository.

Defining the Evaluation Methodology 🔬

graph LR
		subgraph B[Visual Search System]
			B1[Extract Descriptor]
			B2[Compare with other Descriptors]
			B3[Choose **N** Most Similiar Descriptors]
			B4[Map Descriptors back to Images]
		end
    A[Input Image] --> B1
    B1 --> B2
    B2 --> B3
    B3 --> B4
    B4 --> C1[Similar Image 1]
    B4 --> C2[Similar Image 2]
    B4 --> C3[Similar Image 3]
    B4 --> C4[...]
    B4 --> CN[Similar Image N]

As illustrated in the pipeline above, the Visual Search System (VSS) accepts an input image and generates N similar images as output.

We can evaluate our system by examining either the classes or the labels of the retrieved images. While we'll primarily assess the system's performance using a class-based approach, it's important to also consider the label-based method. The label-based approach provides more in-depth information, especially given that the class designations are somewhat arbitrary for the MSRC image set we're using (as you will see in the callout in the conclusion).

In the class-based approach, the goal is to retrieve images that share the same class as the input image.