Computer Vision Tasks With Transformers

Unleashing the power of transformers in computer vision tasks

9 min readAug 11, 2024

In the ever evolving field of computer vision, transformers have emerged as the ultimate game-changer, transforming how machines see and understand the world. No longer confined to language processing, these powerful models are revolutionizing computer vision tasks with unprecedented accuracy and efficiency. From detecting objects in cluttered scenes to segmenting intricate details, transformers are setting new benchmarks and opening doors to groundbreaking innovations. Dive into the transformative power of computer vision with transformers and discover how these sophisticated models are reshaping the landscape of visual intelligence.

Table of Content

Introduction
System requirements
Computer vision tasks
Image classification
Image segmentation
Diffusion models
Learnable parameter

Introduction

In this section, we will delve into teaching machines to see and interpret the
world around us, recognize images, decipher emotions and even generate

visual data. By the time you reach the end of this section, you will

comprehend the fundamental computer vision tasks and learn how to apply
transformers to achieve these objectives. Additionally, we will also discuss

the groundbreaking concept of stable diffusion which has taken the field of
image generation by storm.

System Requirements

For detailed instructions on setting the environment, please follow

instructions below:

Setting Up Environment

Install Anaconda on the local machine.
Create a virtual environment.
Install necessary packages in the virtual environment.
Configure and Start Jupyter Notebook.
Connect Google Colab with your local runtime environment.

Installing Anaconda On Local System

Go-to the anaconda download page by clicking the link below. https://www.anaconda.com/products/distribution
Download the appropriate version for your computer.
Follow the instructions provided by the installer.
If the installer prompts you to add anaconda in the system’s PATH variable, please do it. This enables you to seamlessly use Anaconda’s features from the command line.
Check if installation is successful by typing the following command in the terminal.

conda --version

Creating a Virtual Environment

Open the terminal on your local machine.
Type the following command and press Enter to create a new virtual environment. In the below code, virtual environment name is torch_learn and python version is 3.11.

conda create --name 
torch_learn
python=3.11

3. Once the environment has been created, activate it by typing the following command.

conda activate transformer_learn

4. Install the necessary Package in your environment. Following are requirements for section 2. Install based on each section.

pip3 install transformers
pip3 install datasets
pip3 install git+https://github.com/huggingface/diffusers
pip3 install accelerate
pip3 install ftfy
pip3 install tensorboard
pip3 install Jinja2

To proceed with the coding tasks outlined in this section, please install the
necessary packages detailed as follows:

pip install transformers
pip install datasets
pip install accelerate
pip install torch
pip install torchvision
pip install scikit-learn
pip install diffusers

Computer Vision Task

Table 1.1 illustrates the major tasks in computer vision. The model listed on
the table can be searched and retrieved from

https://huggingface.co/models

Figure 1.1 - Major tasks in computer vision

In this section, we will delve deeper into the important computer
vision tasks we enlisted in Table 1.1.

Image Classification

In section 7, CV Model Anatomy, ViT, DETR and DeiT (ViT.ipynb), we
conducted a cataract image classification project using ViT. In the
accompanying notebook of this section

(deit_and_resnet_comparison.ipynb), we performed the same experiment
with DeiT and ResNet50. Here, are the accuracy results after 5 epochs.

ViT — 61.16%
DeiT — 66.12%
ResNet50 — 29.75%

This experiment demonstrates that DeiT out performs both ResNet50 and
ViT. Prior research has also shown transformers to outperform ResNet50 in
many fine-tuning tasks. Here, are a few benefits of using transformers for
image classification.

Complexity and transfer learning — Both ViT and DeiT have higher
complexity due to the self-attention mechanism. Pre-trained
transformers have been shown to generalize better across a wide
variety of tasks. If you are using transfer learning, then transformers
could be more beneficial than ResNet50.
Multi-modal tasks — Transformers can handle multi-modal data, such
as images and text or images and audio, more naturally than CNNs.
Thus, if your task involves multi modal data, ViT and DeiT might be
more suitable.

Our implementation of the cataract image classifier is relatively basic. There
are a couple of issues.

Our fine-tuning dataset was quite small.
We performed minimal image pre-processing.

To further improve the performance of the cataract dataset, consider
experimenting with various data augmentation techniques. One option might

be to double the dataset size using data augmentation.

Image Segmentation

Image segmentation involves dividing the image into segments or regions where each segment represents a specific object or area of interest. It might
seem that object detection and image segmentation are similar however,

there is a significant difference. The primary goal of object detection is to
identify the presence of objects and provide a rough estimate of their
location using bounding boxes. On the other hand, image segmentation

offers a more detailed representation of objects by assigning a class label to
every pixel in the image, resulting in pixel wise classification. This allows
for the identification of not only the presence of objects but also their precise
shape and boundaries. Let’s consider the some examples to understand when to use object detection
and when to use image segmentation.

Object detection autonomous vehicles — Object detection is used to
identify the presence of various objects, such as traffic lights,
pedestrians and other vehicles. Object detection can quickly
determine the presence of these objects and their approximate
locations, which is crucial for real-time decision making in
autonomous driving.
Image segmentation medical imaging — In medical imaging, such as
CT scans or X-rays, it is essential to identify the exact structure of
organs or tumors. Image segmentation can assign pixel-level
classification, resulting in a fine-grained representation of the object.

For instance, using object detection on the liver would only produce a
rectangular bounding box, which is not helpful. In contrast, image

segmentation provides the precise structure of the liver, which is
crucial for accurate diagnosis and treatment planning.

Project — Image Segmentation For Our Diet Calculator

Problem statement, We are developing a calorie estimation app that
calculates the caloric content of food, based on a picture. The first step in

this process is to capture an image of the meal, then identify different food
categories through image segmentation. By analyzing the segmentation
results, we can estimate the quantities of various food items and ultimately

determine the total calorie count. In this project, we will create a machine
learning model capable of performing image segmentation on food items.

Approaches:

We can approach it in two steps.

Use FoodSeg103-BenchMark-V1 dataset. It has 7118 images of

different food categories. https://github.com/LARC-CMU-

SMUFoodSeg103-Benchmark-v1
Use nvidia/mit-b0 as pre-trained model from huggingface.

Solution:

The end-to-end implementation is provided in the accompanying notebook.
Figure 1.1 shows the result of the inference.

Figure 1.1 - Inference example of image segmentation

Diffusion Model — Unconditional Image Generation

Unconditional image generation is the process of generating realistic images
without providing any conditional information as input. Over the past few
years, numerous generative models have been proposed to address this

problem, with diffusion models demonstrating some of the most promising
results. In this section, we will explore the principles behind diffusion
models and present a project that showcases end-to-end techniques for
training a model specifically tailored for unconditional image generation.

The figure 1.2 depicts the diffusion model, while the code block

corresponds to the model definition. The general principle behind the
diffusion model is to simulate a process that gradually transforms an original
image into random noise and then reverses the process to reconstruct the image from noise. In the context of image generation, diffusion models
consist of two main steps.

from diffusers import UNet2DModel
model = UNet2DModel(
 sample_size=config.image_size,
 in_channels=3,
 out_channels=3,
 layers_per_block=3,
block_out_channels=(64, 128, 256, 512, 1024),
 down_block_types=(
 "DownBlock2D",
 "AttnDownBlock2D",
 "DownBlock2D",
 "DownBlock2D",
 "DownBlock2D",
 ),
 up_block_types=(
 "UpBlock2D",
 "UpBlock2D",
 "AttnUpBlock2D",
 "UpBlock2D",
 "UpBlock2D",
 ),
)

Forward Diffusion

In this step, the model introduces noise into an image in a controlled manner,
making it more like random noise. At each iteration, a new noisy image is
generated based on the previous one and some predefined set of noise. The

process goes through a series of steps, ultimately transforming the original
image into pure noise. There are two major building blocks of forward
diffusion.

DownBlock2D — This block is responsible for downsampling the input
feature maps while increasing the number of channels. In the preceding code example, we increased the channels in each
subsequent block (64, 128, 256, 512, 1024). Typically, it consists of
a series of convolutional layers followed by batch normalization and

activation functions for example, ReLU and a down sampling
operation, such as max-pooling or strided convolution.
AttnDownBlock2D — In addition to the functionality of DownBlock2D, it
includes an attention mechanism, such as self-attention or spatial

attention, within the block.

Inference Process

Figure 1.3 illustrates the inference process in a diffusion model. As depicted,
the model starts with pure noise as its input. In each subsequent step, the
model attempts to remove the noise, gradually reconstructing a new image.
The process follows these main steps.

Input pure noise — At the beginning of the inference process, the
model takes a pure noise image. This noise image serves as the
starting point for the model to reconstruct the target image.
Denoising steps — In each denoising step, the model tries to estimate
the amount of noise that was added during the forward diffusion
process. The model then subtracts the estimated noise from the current
image, refining the image’s appearance. These de-noising steps are performed for a predefined number of steps, with the model continually refining the image at each step.
Final image reconstruction — At the end of the inference process, after
going through all the de-noising steps, the model generates a new
image. This image is the result of the model’s attempt to reverse the

forward diffusion process, transforming the pure noise input into a
realistic image. As you can see in Figure 1.3, our diffusion model was
able to create the picture of the baby at the final step.

The inference process in a diffusion model involves starting

with pure noise and through a series of denoising steps, reconstructing a
new image. Figure 1.3 provides a visual representation of this process,

highlighting the gradual refinement of the image as noise is removed over
multiple steps.

Figure 1.3 - Inference process diffusion model

Learnable Parameter

The main learnable parameters are found in the denoising model, which is
used in both the forward and backward passes. The denoising model
typically consists of neural networks like U-net, transformers and so on. The

learnable parameters in the denoising model include biases and weights of
various layers, including CNN layers, attention mechanisms and linear

layers, depending on the specific architecture.
During forward diffusion, the denoising model is used to simulate the
process of introducing noise. During the backward diffusion, the same

denoising model is used, but the process is reversed. The model tries to
predict the amount of noise added in each step.

Project — DogGenDiffusion

You are part of a creative art company and want to develop unique dog
artwork for inspiration. Your task is to create a diffusion model that performs

unconditional image generation of dogs.

Project name — DogGenDiffusion

Dataset — We will utilize the BirdL/DALL-E-Dogs dataset from

HuggingFace, which contains 1,104 unique dog images.

Data transformation — All images will be resized to 128 x 128 pixels, and
we will experiment with various data transformation techniques.

Model — The UNet2DModel from HuggingFace’s Diffusers library will be
employed for this task.

Conclusion

Finally, we wrap up our exploration into image segmentation and its application in projects like our diet calculator, it’s evident that this technology is more than just a tool it’s a gateway to precision and innovation. The integration of diffusion models and learnable parameters exemplifies how advanced techniques can elevate image segmentation, turning raw data into actionable insights with remarkable accuracy. This journey into the depths of visual processing not only highlights the transformative power of these technologies but also underscores their potential to revolutionize practical applications from dietary planning to beyond. The future of image segmentation is bright and brimming with possibilities, setting the stage for even more groundbreaking advancements in the realm of visual intelligence.