Transforming Vision

Mastering computer vision tasks with cutting-edge transformers

4 min readAug 11, 2024

In the ever evolving field of computer vision, transformers have emerged as the ultimate game changer, transforming how machines see and understand the world. No longer confined to language processing, these powerful models are revolutionizing computer vision tasks with unprecedented accuracy and efficiency. From detecting objects in cluttered scenes to segmenting intricate details, transformers are setting new benchmarks and opening doors to groundbreaking innovations. Dive into the transformative power of computer vision with transformers and discover how these sophisticated models are reshaping the landscape of visual intelligence.

Table of Content

Introduction
System requirements
Computer vision tasks
Image classification

Introduction

In this section, we will delve into teaching machines to see and interpret the
world around us, recognize images, decipher emotions and even generate

visual data. By the time you reach the end of this section, you will

comprehend the fundamental computer vision tasks and learn how to apply
transformers to achieve these objectives. Additionally, we will also discuss

the groundbreaking concept of stable diffusion, which has taken the field of
image generation by storm.

System Requirements

Install Anaconda on the local machine.
Create a virtual environment.
Install necessary packages in the virtual environment.
Configure and Start Jupyter Notebook.
Connect Google Colab with your local runtime environment.

Installing Anaconda On Local System

Go to the Anaconda download page: https://www.anaconda.com/products/distribution
Download the appropriate version for your computer.
Follow the instructions provided by the installer.
If the installer prompt you to add anaconda in system’s PATH variable, please do it. This enables you to seamlessly use Anaconda’s features from the command line.
Check if installation is successful by typing the following command in the terminal.

conda --version

Create a Virtual Environment

Open the terminal on your local machine.
Type the following command and press Enter to create a new virtual environment. In the below code, the virtual environment name is torch_learn and the python version is 3.11.

conda create --name 
torch_learn 
python=3.11

3. Once the environment has been created, activate it by typing the following command.

conda activate transformer_learn

4. Install the necessary Package in your environment. Following are requirements for section 2. Install based on each section.

pip3 install transformers
pip3 install datasets
pip3 install git+https://github.com/huggingface/diffusers
pip3 install accelerate
pip3 install ftfy
pip3 install tensorboard
pip3 install Jinja2

Activate Virtual Environment

conda activate transformer_learn

To proceed with the coding tasks outlined in this section, please install the
necessary packages detailed as follows.

pip install transformers
pip install datasets
pip install accelerate
pip install torch
pip install torchvision
pip install scikit-learn
pip install diffusers

Computer Vision Tasks

Table 1.1 illustrates the major tasks in computer vision. The model listed on
the table can be searched and retrieved from: https://huggingface.co/models

Table 1.1 - Major tasks in computer vision

we will delve deeper into the important computer
vision tasks we enlisted in Table 1.1.

Image Classification

In section 7, CV Model Anatomy, ViT, DETR and DeiT, we

conducted a cataract image classification project using ViT. In the
accompanying notebook of this section (deit_and_resnet_comparison.ipynb), we performed the same experiment
with DeiT and ResNet50. Here are the accuracy results after 5 epochs.

ViT — 61.16%
DeiT — 66.12%
ResNet50 — 29.75%

This experiment demonstrates that DeiT out performs both ResNet50 and
ViT. Prior research has also shown transformers to out perform ResNet50 in
many fine tuning tasks. Here are a few benefits of using transformers for

image classification.

Complexity and transfer learning — Both ViT and DeiT have higher
complexity due to the self-attention mechanism. Pre-trained
transformers have been shown to generalize better across a wide
variety of tasks. If you are using transfer learning, then transformers
could be more beneficial than ResNet50.
Multi-modal tasks — Transformers can handle multi-modal data, such

as images and text or images and audio, more naturally than CNNs.
Thus, if your task involves multi modal data, ViT and DeiT might be
more suitable.

Our implementation of the cataract image classifier is relatively basic. There
are a couple of issues listed below.

Our fine-tuning dataset was quite small.
We performed minimal image pre-processing.

To further improve the performance of the cataract dataset, consider
experimenting with various data augmentation techniques. One option might

be to double the dataset size using data augmentation.

Conclusion

In the realm of computer vision, the intersection of robust system requirements and sophisticated image classification techniques marks a new era of technological advancement. From laying the foundational introduction to understanding the critical system requirements, we have explored how these elements converge to tackle complex computer vision tasks. As we stand on the cusp of this digital revolution, it’s clear that mastering image classification is not just about technological prowess but it’s about harnessing the full potential of these advancements to drive innovation and achieve unparalleled accuracy. The future of computer vision is here and it promises to redefine our interaction with the visual world, making what was once considered science fiction a tangible reality.