Transformers Unleashed

The groundbreaking AI architecture revolutionizing modern technology

11 min readAug 5, 2024

Imagine a world where machines understand and generate human language with the finesse of a skilled writer, where images are analyzed and interpreted with unprecedented accuracy. This is the power of transformer architecture, a breakthrough that has redefined the boundaries of artificial intelligence. Introduced in 2017, transformers have rapidly become the backbone of cutting edge AI models, driving innovations from chatbots to advanced image recognition systems. Welcome to the era of transformers, where the impossible becomes reality.

Table Of Content

Introduction to transformer
Transformer architecture
Embedding
Positional encoding
Model input
Encoding layer
Attention mechanism
Self attention
Multi headed attention
Decoder layer

Introduction to Transformer

Transformers have revolutionized the field of artificial intelligence (AI), marking a significant leap forward from traditional models. Introduced by Vaswani et al. in their groundbreaking 2017 paper "Attention is All You Need," transformers leverage a mechanism known as self attention, allowing them to weigh the importance of different words in a sentence dynamically. This innovation enables them to process and generate human-like text with remarkable accuracy.

Transformers underpin several state of the art models, including OpenAI’s GPT-4 and Google’s BERT. GPT-4, for instance, can compose essays, answer complex questions, and even create poetry, demonstrating the transformative power of this architecture. BERT, on the other hand, excels in understanding the context of words in search queries, significantly improving the relevance of search engine results.

The impact of transformers extends beyond natural language processing. In computer vision, models like Vision Transformers have shown that self-attention can be effectively applied to image recognition tasks, rivaling traditional convolutional neural networks. This versatility underscores the transformative potential of transformers across various AI domains, making them a cornerstone of modern AI research and application.

Transformer Architecture

There are many variants of the transformer; however, in this section, we will
discuss the original transformer architecture proposed by Vaswani et al.

(2017). They proposed the architecture for machine translation, for
example, English to the French Language. Let us highlight the most
important aspects of transformer architecture before going into detail.

Transformer uses an encoder decoder architecture for machine
translation.
The encoder converts the input sequence into a sequence vector, with
the length of the vector being equal to the length of the input
sequence. It consists of multiple encoder blocks.
The decoder also consists of multiple decoder blocks, and the
sequence vector output of encoder is fed to all decoder blocks.
Multi head attention is a primary component of both the encoder and
decoder.
Positional encoding is a new concept introduced in the transformer
architecture that encodes the positional information of each input
token, representing its position in the input sequence.

Architecture of Transformer

Embedding

As shown in figure 1.1, the input sequence in the transformer is represented
by an embedding vector. Embedding is the process of representing a word or

token as vectors of fixed length.

Before we go in depth about embeddings, let us understand how the text was
traditionally represented in NLP. This will help us appreciate why we use embeddings. Traditionally, textual data in machine learning has been

represented as n-gram words. Let us consider the example of 1-gram, if the
total sample has 50,000 unique words, each input sequence would be
represented with a 50,000-dimensional vector. We would fill these
dimensions with the number of times each word appears in the specific input

sequence.

However, this approach has several problems:

Even for small input sequences for example, those with only two
tokens, we require a high-dimensional vector (50,000) resulting in a
highly sparse vector.
There is no meaningful way to perform mathematical operations on
these high dimensional vector representations.

Embedding overcomes those challenges. Embedding is a technique used to
represent the word or sequence by a vector of real numbers that captures the

meaning and context of the word or phrase.
A very simple example of embedding is taking a set of words, such as
cabbage, rabbit, eggplant, elephant, dog, cauliflower and representing

each word as a vector in 2 dimensional space capturing for animal and color
features. The embedding is shown in Figure 1.2.

We can see that the first dimension of cabbage and cauliflower is almost the
same, as both represent vegetables. They are located nearby in the first
dimension. Also, we can perform addition and subtraction on these
embeddings because each dimension represents a specific concept and

tokens are near if they represent similar concepts.
Interestingly, in the real world, we mostly use a pre-trained model like BERT

or word2vec, which has been trained with billions of examples and extract
large dimension of feature BERT use 768 dimensions. The embedding is
highly accurate as compared to n-gram and offers greater flexibility during

NLP.

Positional Encoding

Positional encoding in a transformer is used to provide the model with
information about the position of each word in the input sequence. Unlike previous architecture like LSTM where each token is processed in
sequence one by one the transformer processes the input tokens in parallel.
This means each token should also have positional information.
Let us understand how positional encoding is done. In the Attention is All
You Need paper, the authors use a specific formula for positional encoding.

PE(pos, 2i) and PE(pos, 2i + 1) are the i – th and (i + 1) – th dimensions of
the positional encoding vector for position pos in the input sequence.
pos is the position of the word in the input sequence, starting from 0.
i is the index of the dimension in the positional encoding vector, starting
from 0.
d is the dimensionality of the embedding 512 in the original architecture

This formula generates a set of positional encodings that are unique for each
position in the input sequence and that change smoothly as the position

changes.
It is important to understand that there are 256 pairs (512/2) of sine and
cosine values. Thus, i goes from 0 to 255.

The encoding of first word position=0 will be.

Thus, positional encoding of the first word will look like [0,1,0,1,…1]. The
positional encoding for the second word will look like

[0.8414,0.5403,0.8218,..]. If the embedding is of 512 dimensions. The
position encoding vector looks like.

As depicted in Figure 1.3, model input is the pointwise addition of positional
encoding and embedding vector. Let us understand how we achieve this.

To represent “I Live In New York” with a tokenized length of 5, we add
1 tokens.

At first, each token is represented by Integer. Here, word I is represented by
8667, Live is represented by 1362 , In is represented by 1300, New York is

represented by 1301 and <pad> represented by 0. The resulting will be.

We then pass these tokenized sequences to the embedding layer. The
embedding of each token is represented by a vector of 512 dimensions. In
the below example, the dimension of the vector [embeddingtoke8667] is
512.

Finally, we perform the pointwise addition of Embedding and positional
Encoding before feeding into the model.

PositionalEncodingVector

= [[size=512], [size = 512], [size = 512], [size = 512], [size = 512]] +

Embedding
= [[embeddingtoken_8667], [embeddingtoken1362], [embeddingtoken1300],

[embeddingtoken1301]

[embeddingtoken0] =

ModelInput = [[size = 512], [size = 512], [size = 512], [size = 512], [size =
512]]

Encoding Layer

The encoder layer is a crucial component in the transformer architecture,
responsible for processing and encoding input sequences into vector

representations.

Let us understand each subcomponent of the encoder layer in detail:

Input to the encoder — The input to the first layer of the encoder is the
pointwise summation of embeddings and positional encoding.
Multi-head Attention — A key component of the encoder block in a
transformer is the multi-head self-attention mechanism. This
mechanism allows the model to weigh the importance of different

parts of the input when making a prediction. In a later section, we will
discuss the details of multi-head attention.
Add and norm layer — The add layer, also known as the residual
connection, is used to add the input to the output of the previous layer before passing it through the next layer. This allows the model to learn
the residual function, which is the difference between the input and the
output, rather than the actual function. This can help to improve the

performance of the model, especially when the number of layers is
large. The norm layer normalizes the activations of a layer across all
of its hidden units. This can help to stabilize the training of the model
by preventing the input from getting too large or too small, which can
cause issues such as vanishing gradients or exploding gradients.
Feed forward — The output of the multi-head self attention mechanism
is fed to the input of the feed forward layer. Additionally, a non linear

activation function is applied. The feed forward layer is important to
extract the higher-level feature from the data. We also have add and

norm layer after the feed forward layer. The output of this is fed to
next encoding block.
Encoder output — The last block of the encoder produces a sequence
vector, which is then sent to the decoder blocks as features.

Attention Mechanism

The attention mechanism has emerged as a versatile and powerful neural
network component that allows models to weigh and prioritize relevant

information in a given context. Its core concepts, self-attention, and multi headed attention are instrumental in enabling the transformer architecture to

achieve remarkable results. Let us delve into these concepts in more detail. Self attention mechanism is the key to the performance of the transformer.
Let us understand how it works. Consider the two examples.

What does it refers in each sentence. We cannot answer just by
understanding the location and structure of the sentence. According to Vaswani et al. (2017), Meaning is a result of relationships between things
and self-attention is a general way of learning relationships.
Self attention calculates the relationship weight between each token in the

input sentence. Through this mechanism, the model understands the meaning
of the input sentence.

Let us look at the attention calculation for “it” in both sentences. Figure 1.5
demonstrates the calculation of relationship weights in the self-attention
mechanism. In the first sentence, when we are processing the word “it” the
model provides more weight to the rabbit than other words, whereas, in the

second sentence model provides more weight to tasty.

Multi Headed Attention

Instead of using just one attention head, self attention block use multiple
heads. Each head uses different parameters with a different focus to extract
different features from the input.

Figure 1.6 depicts the same example again with two attention heads. Head1
is represented in the diagram in red, whereas Head2 is represented in yellow.
We can see that different heads are capturing different contextual
relationships.

Decoder Layer

The decoder has a similar structure to the encoder but with an additional
component called the masked self attention mechanism. Let us look at the

decoder architecture in detail.

Decoder input — During training, the input to the first layer of the

decoder are pointwise summation.

Embeddings of the target.
Positional encoding of the target sequence.

Masked multi-head attention — The key difference between masked
multi head attention and regular multi-head attention is that in masked
multi head attention, certain parts of the input sequence are masked or

blocked so that the decoder cannot see them when generating the

output sequence. The positions of the input sequence that correspond
to the future target tokens the tokens that have not been generated
yet are masked.

Decoder works on generating one word at a time.
We only show the word until the current position; thus, the decoder
will not be able to see the future targets that need to be generated.
As shown in the diagram, the input to the masked multi-head

attention is vector generated by step 1--pointwise summation of.

Embeddings of the target.
Positional encoding of the target sequence.

Multi head attention — As you see in the diagram, the input to the

multi-head attention mechanism in the decoder is typically the output
of the encoder and the previously generated tokens in the output
sequence. Instead of just the first decoder block, all decoder blocks
receive the output of the encoder because.

The model can ensure that information from the input sequence is
propagated through the entire decoding process.
The model is effectively regularized since each decoder block has

access to the same information. This can help to prevent overfitting
and improve the generalization performance of the model.
Feed forward — The output of the multi head self attention mechanism
is fed to the input of the feed forward layer. Additionally, a non linear

activation function is applied. The feed forward layer is important for
extracting the higher level feature from the data.
Linear layer — In transformer architecture, the linear layer in the
decoder is a component that is used to produce the final output of the
decoder. The input to the linear layer in the decoder is the output of
the final layer of the decoder. Additionally, the SoftMax activation
function is applied to generate the probabilities of the next word in the
sequence.

Conclusion

Finally, transformers have undeniably revolutionized the field of artificial intelligence, setting a new benchmark for efficiency and performance. By delving into the intricacies of transformer architecture, we uncover the profound impact of embeddings and positional encoding, which enable the model to grasp the context and sequence of data. Central to this architecture is the attention mechanism, particularly self-attention, which allows transformers to dynamically prioritize different parts of the input data, enhancing their understanding and generating more accurate outputs. Multi-headed attention further amplifies this capability by allowing the model to focus on multiple aspects of the data simultaneously. The decoder layer integral to tasks such as language translation and text generation orchestrates the process of transforming encoded information into coherent, meaningful output. Together, these components illustrate the transformative power of transformers, heralding a new era in AI that promises to continually push the boundaries of what machines can achieve.