Mastering Transformers

Unveiling the secrets of their training process

A.I Hub
5 min readAug 5, 2024
Image owned by Machigan Online

Imagine unlocking the secrets of human like language comprehension in machines this is the power of training transformer models. The training process of transformers, a marvel of modern AI, involves intricate steps that breathe life into these models enabling them to generate coherent text, translate languages and even answer complex questions with uncanny precision.

It’s a journey through vast datasets, sophisticated algorithms and immense computational power, all working in harmony to create models that understand and respond to human language in ways we once thought impossible. Welcome to the heart of transformer training, where innovation meets intelligence.

Table of Content

  • Training process of transformer
  • Inference process of transformer
  • Types of transformers and their applications
  • Encoder only model
  • Decoder only model
  • Encoder decoder model

Training Process of Transformers

Image owned by codiste

The training process of a transformer for machine translation typically
includes the steps below:

  1. Data pre-processing and Generating Positional Encoding of Input and
    target.
  2. Passing through the encoder and decoder layer.
  3. Loss calculation — The generated output sequence is compared to the
    target output sequence and a loss value is calculated using a loss
    function such as cross entropy.
  4. Backpropagation — The gradients of the loss with respect to the
    model’s parameters are calculated using back propagation.
  5. Optimization — The model’s parameters are updated using an
    optimization algorithms, such as Adam to minimize the loss value.
    Repeat steps 3-7 for multiple epochs until the model’s performance on

    a validation set stabilizes or reaches a satisfactory level.

It is also worth noting that during the training process, the model is exposed
to large amounts of parallel text data, where the input and output sequences
are already aligned and the model learns to map the input sequence to the

output sequence through the attention mechanism and linear layers.

Inference Process of Transformers

Image owned by Linked In

The inference process of a transformer typically includes the steps below:

  1. Data pre-processing and Generating Positional Encoding of Input. It
    is noteworthy that during inference, we will not have a target
    sequence.
  2. Passing through the encoder and decoder block. It is noteworthy that
    for decoder input, there is a slight difference during training and

    inference. During training, we pass the actual target to the first
    decoder block. whereas, during inference, instead of a target, we will
    pass tokens that are inferred till the current state. The reason is that
    we do not have a target sequence during inference.

Types of Transformers and Their Applications

Image owned by Linked In

Until now, we explained the architecture of the transformer for machine
translation. Nonetheless, there are many variations of the transformer. Let us
review them.

Encoder Only Model

It only has the encoder layer of a transformer model. The attention layer can
access all the words in the initial sentence. The encoder only model often
has bi-directional attention and is called an auto encoding model. Let us look
at examples and applications of the encoder only model.

Examples:

  • Bidirectional Encoder Representations from Transformers

    (BERT) —
    BERT is a pre-trained transformer encoder-only model that
    has been trained on a large corpus of text and has been shown to be
    effective in a wide range of natural language processing tasks,
    including sentiment analysis, text classification and question

    answering.
  • A Lite BERT (ALBERT) — ALBERT is a lightweight version of BERT
    that has been shown to achieve similar performance as BERT while
    using less computational resources.

Applications

  • Sentiment analysis — The encoder can be trained to extract features
    from a given text and predict its sentiment positive, negative or

    neutral.
  • Text classification — The encoder can be trained to classify a given text
    into different categories, such as news, sports, politics and so on.
  • Named entity recognition — The encoder can be trained to identify
    entities such as people, organizations and locations in a given text.
  • Language modeling — The encoder can be trained to predict the next
    token in a sequence of tokens.

Decoder Only Model

Image owned by UT News

It only uses the decoder layer of a Transformer architecture. The attention
layer only has access to the sequence till the current token. This type of
model is often called an autoregressive model because they are trained to
predict the next token in a sequence based on the previous tokens in the
same sequence.

Examples:

  • GPT from the OpenAI.
  • A Conditional Transformer Language Model for Controllable
    Generation Model.

Applications:

  • The decoder only model has great usage in text generation and
    natural language generation.

Encoder Decoder Model

Image owned by Lark

Also called the sequence to sequence model, uses both an encoder and a
decoder. The original paper proposing transformer is encoder decoder

model. The attention layer at the encoder has access to all tokens in the input
sequence, whereas the attention layer of the decoder has a view of only the
current and past tokens. The future tokens are masked to the attention layer

of the decoder.

Examples:

  • BART (Denoising Autoencoder pre-training for sequence generation
    tasks).

Applications:

  • Machine translation

Conclusion

Finally, The transformer architecture represents a pinnacle of innovation in artificial intelligence, with its training process intricately designed to handle vast amounts of data and extract meaningful patterns through self attention mechanisms. Once trained, the inference process enables these models to perform tasks with remarkable accuracy and efficiency. The versatility of transformers is showcased through various types, each tailored to specific applications encoder only models excel in tasks like text classification and sentiment analysis decoder only models are powerful for text generation and auto completion and encoder decoder models shine in translation and summarization tasks. Together, these models highlight the transformative impact of transformer architecture across a broad spectrum of AI applications, driving advancements in how machines understand and generate human language.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet