Mastering Transformers

Unveiling the secrets of their training process

5 min readAug 5, 2024

Imagine unlocking the secrets of human like language comprehension in machines this is the power of training transformer models. The training process of transformers, a marvel of modern AI, involves intricate steps that breathe life into these models enabling them to generate coherent text, translate languages and even answer complex questions with uncanny precision.

It’s a journey through vast datasets, sophisticated algorithms and immense computational power, all working in harmony to create models that understand and respond to human language in ways we once thought impossible. Welcome to the heart of transformer training, where innovation meets intelligence.

Table of Content

Training process of transformer
Inference process of transformer
Types of transformers and their applications
Encoder only model
Decoder only model
Encoder decoder model

Training Process of Transformers

The training process of a transformer for machine translation typically
includes the steps below:

Data pre-processing and Generating Positional Encoding of Input and
target.
Passing through the encoder and decoder layer.
Loss calculation — The generated output sequence is compared to the
target output sequence and a loss value is calculated using a loss
function such as cross entropy.
Backpropagation — The gradients of the loss with respect to the
model’s parameters are calculated using back propagation.
Optimization — The model’s parameters are updated using an
optimization algorithms, such as Adam to minimize the loss value.
Repeat steps 3-7 for multiple epochs until the model’s performance on

a validation set stabilizes or reaches a satisfactory level.

It is also worth noting that during the training process, the model is exposed
to large amounts of parallel text data, where the input and output sequences
are already aligned and the model learns to map the input sequence to the

output sequence through the attention mechanism and linear layers.

Inference Process of Transformers

The inference process of a transformer typically includes the steps below:

Data pre-processing and Generating Positional Encoding of Input. It
is noteworthy that during inference, we will not have a target
sequence.
Passing through the encoder and decoder block. It is noteworthy that
for decoder input, there is a slight difference during training and

inference. During training, we pass the actual target to the first
decoder block. whereas, during inference, instead of a target, we will
pass tokens that are inferred till the current state. The reason is that
we do not have a target sequence during inference.

Types of Transformers and Their Applications

Until now, we explained the architecture of the transformer for machine
translation. Nonetheless, there are many variations of the transformer. Let us
review them.

Encoder Only Model

It only has the encoder layer of a transformer model. The attention layer can
access all the words in the initial sentence. The encoder only model often
has bi-directional attention and is called an auto encoding model. Let us look
at examples and applications of the encoder only model.

Examples:

Bidirectional Encoder Representations from Transformers

(BERT) — BERT is a pre-trained transformer encoder-only model that
has been trained on a large corpus of text and has been shown to be
effective in a wide range of natural language processing tasks,
including sentiment analysis, text classification and question

answering.
A Lite BERT (ALBERT) — ALBERT is a lightweight version of BERT
that has been shown to achieve similar performance as BERT while
using less computational resources.

Applications

Sentiment analysis — The encoder can be trained to extract features
from a given text and predict its sentiment positive, negative or

neutral.
Text classification — The encoder can be trained to classify a given text
into different categories, such as news, sports, politics and so on.
Named entity recognition — The encoder can be trained to identify
entities such as people, organizations and locations in a given text.
Language modeling — The encoder can be trained to predict the next
token in a sequence of tokens.

Decoder Only Model

It only uses the decoder layer of a Transformer architecture. The attention
layer only has access to the sequence till the current token. This type of
model is often called an autoregressive model because they are trained to
predict the next token in a sequence based on the previous tokens in the
same sequence.

Examples:

GPT from the OpenAI.
A Conditional Transformer Language Model for Controllable
Generation Model.

Applications:

The decoder only model has great usage in text generation and
natural language generation.

Encoder Decoder Model

Also called the sequence to sequence model, uses both an encoder and a
decoder. The original paper proposing transformer is encoder decoder

model. The attention layer at the encoder has access to all tokens in the input
sequence, whereas the attention layer of the decoder has a view of only the
current and past tokens. The future tokens are masked to the attention layer

of the decoder.

Examples:

BART (Denoising Autoencoder pre-training for sequence generation
tasks).

Applications:

Machine translation

Conclusion

Finally, The transformer architecture represents a pinnacle of innovation in artificial intelligence, with its training process intricately designed to handle vast amounts of data and extract meaningful patterns through self attention mechanisms. Once trained, the inference process enables these models to perform tasks with remarkable accuracy and efficiency. The versatility of transformers is showcased through various types, each tailored to specific applications encoder only models excel in tasks like text classification and sentiment analysis decoder only models are powerful for text generation and auto completion and encoder decoder models shine in translation and summarization tasks. Together, these models highlight the transformative impact of transformer architecture across a broad spectrum of AI applications, driving advancements in how machines understand and generate human language.

Mastering Transformers

Unveiling the secrets of their training process

Table of Content

Training Process of Transformers

Inference Process of Transformers

Types of Transformers and Their Applications

Encoder Only Model

Examples:

Applications

Decoder Only Model

Examples:

Applications:

Encoder Decoder Model

Examples:

Applications:

Conclusion

Written by A.I Hub

No responses yet