🔍 How to Explore and Visualize ML-Data for Object Detection in Images

Use Ultralytics YOLOv8 detections and ViT embeddings to visualize and navigate the data in Renumics Spotlight 1.6.0

Published in

ITNEXT

6 min readJan 2, 2024

The need to understand ML-data in-depth is increasingly recognized. However, it is still not widely practiced in computer vision due to the large effort required to review large datasets. It is impossible to get a good understanding of the dataset by just clicking through images.

Especially in Object Detection, a subset of computer vision, locating objects within images by defining a bounding box is not just about recognizing objects. It’s also about understanding their context, size and relationship with other elements in the scene. Therefore a good overview of the class distribution, the variety of object sizes, and the common contexts in which classes appear helps in the evaluation and debugging to find error patterns in a trained model, making the selection of additional training data more targeted.

We suggest the following approaches:

Bring structure to your data using enrichments from pre-trained or foundation models: For example, creating image embeddings and employing dimension reduction techniques like t-SNE or UMAP. These can generate similarity maps, making it easier to navigate through the data. Alternatively, using detections from pre-trained models can extract context.
Use a visualization tool capable of integrating this structure together with statistics and review functionality for the raw data.

This article offers a tutorial on how to create an interactive visualization for object detection using Renumics Spotlight. As an example, we consider

building a visualization for a detector for people in images.
The visualization includes a similarity map, filters, and statistics to navigate the data.
Additionally, it allows for the review of each image with ground truth and detection of Ultralytics YOLOv8 in detail.

The goal visualization of this article in Renumics Spotlight. Source: Created by the author.

Download images with persons from COCO Dataset

First, install the required packages,

!pip install fiftyone ultralytics renumics-spotlight

With the resumable download function from FiftyOne you can download images from the COCO dataset [1] with handy parameters to include only 1,000 images that contain one or more persons:

import pandas as pd
import numpy as np

import fiftyone.zoo as foz


# download 1000 images from the COCO dataset with persons
dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",
    label_types=[
        "detections",
    ],
    classes=["person"],
    max_samples=1000,
    dataset_name="coco-2017-person-1k-validations",
)

Now you can use:

def xywh_to_xyxyn(bbox):
    """convert from xywh to xyxyn format"""
    return [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]]


row = []
for i, sample in enumerate(dataset):
    labels = [detection.label for detection in sample.ground_truth.detections]
    bboxs = [
        xywh_to_xyxyn(detection.bounding_box)
        for detection in sample.ground_truth.detections
    ]
    bboxs_persons = [bbox for bbox, label in zip(bboxs, labels) if label == "person"]
    row.append([sample.filepath, labels, bboxs, bboxs_persons])

df = pd.DataFrame(row, columns=["filepath", "categories", "bboxs", "bboxs_persons"])
df["major_category"] = df["categories"].apply(
    lambda x: max(set(x) - set(["person"]), key=x.count)
    if len(set(x)) > 1
    else "only person"
)

to prepare the data as a Pandas DataFrame, with the columns for filepath, categories of the bounding boxes, the bounding boxes, the bounding boxes containing persons and the major category (despite persons) to specify the context of the persons in the image:

Now you can visualize it with Spotlight:

from renumics import spotlight
spotlight.show(df)

You can use the add view button in the inspector view, and select bboxs_persons together with filepath in a BoundingBox view to display the corresponding bounding boxes with the images:

Enrich the data with Embeddings

To bring structure to the data you can use image embeddings (dense vector representations) of foundation models. Therefore, ViT embeddings for the whole image can be applied to structure the dataset by using further dimensionality reduction techniques like UMAP or t-SNE to provide a 2D similarity map of the images. Also, the output of a pre-trained object detector can be used to structure the data by size or number of contained objects. Additionally, using the output from a pre-trained object detector can help to categorize the data based on the size or number of objects detected. Since the COCO dataset already provides this information, we utilize it directly.

Spotlight has integrated support for the google/vit-base-patch16–224-in21k Vision Transformer (ViT) model [2] and for UMAP. It is automatically applied when filepath is used to create embeddings:

spotlight.show(df, embed=["filepath"])

Spotlight will calculate the embeddings and apply UMAP to show the result in the similarity map. The color encodes the major category. Now you can use the similarity map to navigate through the data:

Results of pre-trained YOLOv8

Ultralytics YOLOv8 is a state-of-the-art object detection model for fast identification of objects. It is designed for quick image processing and suitable for real-time detection tasks and can be applied to a large amount of data without much waiting time.

You can start with loading the pre-trained model:

from ultralytics import YOLO
detection_model = YOLO("yolov8n.pt")

and do the detections:

detections = []
for filepath in df["filepath"].tolist():
    detection = detection_model(filepath)[0]
    detections.append(
        {
            "yolo_bboxs": [np.array(box.xyxyn.tolist())[0] for box in detection.boxes],
            "yolo_conf_persons": np.mean([
                np.array(box.conf.tolist())[0]
                for box in detection.boxes
                if detection.names[int(box.cls)] == "person"
            ]),
            "yolo_bboxs_persons": [
                np.array(box.xyxyn.tolist())[0]
                for box in detection.boxes
                if detection.names[int(box.cls)] == "person"
            ],
            "yolo_categories": np.array(
                [np.array(detection.names[int(box.cls)]) for box in detection.boxes]
            ),
        }
    )
df_yolo = pd.DataFrame(detections)

On a GeForce RTX 4070 Ti with 12 GB, this process takes less than 20 seconds. Now you can include the results in the DataFrame and visualize it with Spotlight:

df_merged = pd.concat([df, df_yolo], axis=1)
spotlight.show(df_merged, embed=["filepath"])

Spotlight will again calculate the embeddings and apply UMAP to show the result in the similarity map. But this time, you can choose the confidence level of the model for the detected objects and use the similarity map to navigate through clusters with low confidence. These are images where the model is unsure, and the images are generally similar.

This short analysis shows that the model encounters systematic issues with images in the following clusters:

Trains with people standing outside appearing very small due to the large size of the train
Buses and other large vehicles with people inside hardly visible
Planes with people standing outside
Close up images of food with only hands or fingers of people shown

You can decide whether these issues really impact your person detection objectives, and if so, consider enhancing the dataset with additional training data to optimize the model’s performance in these specific scenarios.

What’s next

Visualization for Object Detection can be made easier with the use of pre-trained models and tools like Spotlight that enhance the data science workflow. Give the code a try with your own data and let us know your results in the comments!

I am a professional with expertise in creating advanced software solutions for the interactive exploration of unstructured data. I write about unstructured data and use powerful visualization tools to analyze and make informed decisions.

References

[1] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár, Microsoft COCO: Common Objects in Context (2014), arXiv

[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), arXiv