Dissecting Visionary Titans

The anatomy of ViT, DETR and DeiT models in computer vision

8 min readAug 11, 2024

In the ever evolving world of computer vision, three groundbreaking models have redefined how machines perceive and interpret images, vision transformers (ViT), detection transformers (DETR) and Data-efficient image transformers (DeiT). These models don’t just push the boundaries of what’s possible, they shatter them, offering unprecedented accuracy, efficiency and scalability. In this deep dive, we will explore the anatomy of these state-of-the-art CV models and uncover the revolutionary innovations that make them the gold standard in visual processing today.

Table of Content

Introduction
System requirements
Image pre-processing with example
AI eye doctor

Introduction

This section presents an in-depth exploration of the Vision Transformer
(ViT), a novel approach in computer vision that leverages the transformer

architecture which is traditionally associated with breakthroughs in natural
language processing. ViT is crucial in the field of computer vision as it
introduces a method for processing images as sequences of patches, applying

self-attention across these patches to understand the global context of the
image, enhancing performance on complex tasks like image classification.

Alongside ViT, we will delve into image pre-processing, an indispensable
stage that involves resizing, normalizing and augmenting images to make

them compatible with transformer models. This process ensures that our
models are fed high-quality, standardized data which is crucial for effective
learning and accurate results. The section also covers the Distilled Vision
Transformer (DeiT) and the Detection Transformer (DETR), two

advanced iterations of transformer based models. DeiT refines the training
process through knowledge distillation, leading to more efficient learning
when data is scarce, while DETR revolutionizes object detection by

interpreting images as a set of objects, eliminating the need for the complex
region proposal networks used in traditional methods.

System Requirements

Setting Up Environment

Install Anaconda on the local machine.
Create a virtual environment.
Install necessary packages in the virtual environment.
Configure and Start Jupyter Notebook.
Connect Google Colab with your local runtime environment.

Installing Anaconda On Local System

Go to the Anaconda download page: https://www.anaconda.com/products/distribution
Download the appropriate version for your computer.
Follow the instructions provided by the installer.
If the installer prompts you to add anaconda in the system PATH variable, please do it. This enables you to seamlessly use Anaconda’s features from the command line.
Check if installation is successful by typing the following command in the terminal.

conda --version

Create a Virtual Environment

Open the terminal on your local machine.
Type the following command and press Enter to create a new virtual environment. In the below code, the virtual environment name is torch_learn and the python version is 3.11.

conda create --name 
torch_learn 
python=3.11

3. Once the environment has been created, activate it by typing the following command.

conda activate 
transformer_learn

4. Install the necessary Package in your environment. Following are requirements for section 2. Install based on each section.

pip3 install transformers
pip3 install datasets
pip3 install git+https://github.com/huggingface/diffusers
pip3 install accelerate
pip3 install ftfy
pip3 install tensorboard
pip3 install Jinja2

Activate Virtual Environment

conda activate transformer_learn

To proceed with the coding tasks outlined in this section, please install the
necessary packages detailed as follows.

pip install transformers
pip install datasets
pip install accelerate
pip install torch
pip install torchvision
pip install scikit-learn

Image Pre-processing

Image pre-processing is an essential step in computer vision. Similar to NLP,
where raw text is converted into embeddings, certain steps must be
conducted before feeding images into any machine learning model. The

essential steps of image pre-processing are.

Image resizing — Most ML models require fixed image dimensions.
Thus, based on the model requirement, you need to resize the image.
For example, if you are using vit_base_patch16_224, the model
requires your image to be of 224*224 dimensions. This is an essential
step.
Image normalization — This is the process of scaling pixel values to a
specific range, usually between 0 and 1, or -1 and 1. It helps stabilize
the learning process, making it easier for the model to converge and
learn optimal weights. There are many techniques, such as min-max
scaling, mean-standard deviation scaling, and dividing pixel values by
255. Although it is an optional step, it is highly recommended.
Data augmentation — This involves applying random transformations
to the original images. It helps improve the model’s generalization

capabilities by exposing it to a diverse set of examples. Additionally,
data augmentation techniques can be used to create new samples by

transforming original samples. This is an optional step. The following
Table 1.1 illustrates basic data augmentation techniques.

Table 1.1 - Data augmentation techniques

Grayscale conversion — This involves converting color images into
grayscale with a single channel. It is useful when color information is
not relevant to the task at hand. This reduces the image size, thus
decreasing the computation requirements for training and inference. It
is an optional step.
RGB conversion — Sometimes, color images may have additional
channels, like an alpha channel. In such cases, you need to convert the

images into RGB format. This is an optional step.

Examples of Image Processing

Now, let us have a demo of the image pre-processing technique.

import torch
import torchvision.transforms as T
from PIL import Image
import requests
from io import BytesIO

# Load an example image
url = "provide your image file path link here"
response = requests.get(url)
img = Image.open(BytesIO(response.content))
# convert to RGB
img = img.convert("RGB")

Display the original image using the following code:

from IPython.display import display
display(img)

Figure 1.2 - Example image before pre-processing

Performing Pre-processing

The transformation does the following things:

RandomRotation — Randomly rotates the image between -15 and 15
degrees. Fills vacant pixels with zero.
RandomResizedCrop — Resizes the image to a size of 224 x 224 pixels.
Additionally, scales the image randomly between 80% and 100% of its
original size.
RandomHorizontalFlip — Applies a horizontal flip to the image with a
50% probability.
RandomVerticalFlip — Applies a vertical flip to the image with a 50%
probability.
ColorJitter — Adjusts the image’s brightness, contrast, and saturation.
ToTensor — Converts the image to a PyTorch tensor.
Normalize — Normalizes the image using the specified mean and
standard deviation values.

Refer to the following code:

# Define the resizing, data augmentation, and normalization pipeline
transforms = T.Compose([T.RandomRotation(degrees=(-15, 15), fill=0),
 T.RandomResizedCrop(size=(224, 224), scale= (0.8, 1.0)),
 T.RandomHorizontalFlip(p=0.5),
 T.RandomVerticalFlip(p=0.5),
 T.ColorJitter(brightness=0.2, contrast=0.2,
saturation=0.2, hue=0.1),
 T.ToTensor(),
 T.Normalize(mean=[0.485, 0.456, 0.406], std=
[0.229, 0.224, 0.225]),])

# Apply the data augmentation pipeline to the image
augmented_img = transforms(img)

# To visualize the augmented image, you can convert it back to a PIL image

# Don't forget to undo the normalization before converting it
unnormalized_img = T.Compose([T.Normalize(mean=[-0.485/0.229, -0.456/0.224, -0.406/0.225], std=[1/0.229, 1/0.224, 1/0.225]), T.ToPILImage(),])(augmented_img)

Display the transformed Image using the following code:

from IPython.display import display
display(unnormalized_img)

Figure 1.3 - Example image after image pre-processing

Vision Transformers Architecture

Dosovitskiy1 et al. proposed the vision transformer architecture (ViT),
which is an adaptation of the original transformer architecture for image
classification tasks. The idea behind the ViT is to treat an image as a
sequence of fixed-sized non-overlapping patches. This is similar to how a
transformer treats natural language as a sequence of tokens.

The key components of ViT are as follows:

Image pre-processing — Resize the image and split it into non-

overlapping patches for example, 16x16 pixels.
Patch embedding — Flatten each patch into a 1D vector and linearly
embed it into a high dimensional representation. This is similar to
token embedding in NLP.
Positional encoding — Add location information to each patch.
Transformer layers — Pass the patch embedding through the

Transformer layers.
Classification — Pass the output of Transformer layers to the fully
connected layers and perform a softmax function to calculate the
probability for classification.

The original ViT model was pre-trained on the ImageNet-21k dataset, which
comprises 14 million images and 21K classes. Its pre-training objective was
to minimize the cross entropy loss between predicted class probabilities and
true labels. You can obtain the ViT model through the timm or huggingface
libraries. Various ViT model variations exist, based on factors such as patch

size, image size and more. As of April 30, 2023, there are 143 models
available in the timm library. To list all models available in timm, you can run
the following code.

import timm
all_models = timm.list_models()
vit_models = [model for model in all_models if
'vit' in model]
print("Available ViT models in timm:")
for model in vit_models:
 print(model)

Let us see how we can declare ViT model from timm:

import timm
model = timm.create_model("vit_base_patch16_224", in_chans=3, num_classes=4, pretrained=True)

What the above code does is, it uses pre-trained ViT image with 12 layers, a
hidden size of 768, 12 heads, and a patch size of 16x16 pixels. The input
image size is 224x224 pixels. Additionally, it also add classification head
with 4 classes on the output.

AI Eye Doctor — Project

Carry out the following project:

Obtain the cataract dataset from Kaggle: https://www.kaggle.com/datasets/jr2ngb/cataractdataset.
This dataset contains eye images and is categorized into four classes,
normal, cataract, glaucoma and retina_diseases.
The objective is to develop a classifier capable of automatically
identifying the type of eye disease present in the images.

To aid you in this task, a complete end-to-end implementation provided in
the notebook located in the section directory on GitHub.

Conclusion

Finally, we are conclude our journey through the landscape of system requirements, image pre-processing and the transformative power of AI in healthcare, it’s clear that the fusion of technology and medicine is reshaping our future. The "AI Eye Doctor" project exemplifies how precise preparation from understanding the hardware demands to mastering image pre-processing techniques, culminates in life-changing applications. By harnessing these tools, we are not just building models we are paving the way for intelligent systems that can diagnose and heal with unprecedented accuracy, bringing us closer to a world where AI is a cornerstone of compassionate and effective healthcare.