Decoding Positional Encoding

The key to an enhanced transformer performance

7 min readAug 7, 2024

Imagine a world where AI models cannot only understand individual words but also grasp their order and context with exceptional clarity. This is the power of positional encoding, a crucial component in transformer models that enables them to comprehend the intricate structure of language.

By embedding positional information directly into the model, positional encoding ensures that transformers can capture the sequence of words, enhancing their ability to generate coherent and contextually accurate text. Welcome to the fascinating realm of positional encoding, where the sequence matters just as much as the words themselves.

Table of Content

Positional encoding
Masking
Encoder component of a transformer
Decoder component of a transformer

Positional Encoding

PyTorch does not have an inbuilt positional encoding module. Thus, let us
write a class to do positional encoding, The positional embedding excepts
the embedding vector and returns the positional encoding information
attached to the embedding vector. Importantly, Encoder excepts the data in
the form of [sequence length, batch size, embedding dimension]. Thus,
the input and output of the PE should adhere to that dimension.

class PositionalEncoding(nn.Module):

 def __init__(self, dim_embedding, dropout=0.1, max_seq_len=5000):

 super(PositionalEncoding, self).__init__()
 self.dropout = nn.Dropout(p=dropout)

 postional_encoding = torch.zeros(max_seq_len, dim_embedding)
 position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
 denom_term = torch.exp(torch.arange(0, dim_embedding, 2).float() * (-math.log(10000.0) / dim_embedding))

 postional_encoding[:, 0::2] = torch.sin(position * denom_term)
postional_encoding[:, 1::2] = torch.cos(position * denom_term)

 postional_encoding = postional_encoding.unsqueeze(0).transpose(0, 1)
 self.register_buffer('postional_encoding', postional_encoding)

 def forward(self, x):
 x = x + self.postional_encoding[:x.size(0),:]
 return self.dropout(x)

Explaination:

In line 7, we are doing unsqueeze so that the position tensor changes
to the dimension of [max_seq_len ,1]. This is required for matrix

multiplication on lines 7 and 8.
In line 11, the unsqueeze(0).transpose(0, 1) operation is used to

change the shape of the positional encoding tensor to match the
expected input shape of the transformer model.
unsqueeze(0), This operation adds an extra dimension at position 0. If
the original shape of the positional encoding tensor pe is [max_len,d_model], after unsqueeze(0), the shape becomes [1, max_len,d_model]. This operation essentially turns the 2D tensor into a 3D
tensor with a batch dimension of size 1.
transpose(0, 1), This operation swaps the first two dimensions of the
tensor. So, the shape [1, max_len, d_model] becomes [max_len, 1,
d_model]. This transposition is done to make the positional encoding
tensor compatible with the input shape that the transformer expects,

which is [sequence length, batch size, embedding dimension].

The examples of using a Positional Encoder are provided in the

accompanying notebook.

Masking

Masking is a crucial concept in the transformer architecture, as it is used to
hide or replace specific input tokens during processing. A thorough
understanding of masking is essential to create an accurate transformer
model. These masking parameters are present in all variations of transformer
models and it is important to have a good grasp of them before delving into
actual model development.

tgt_mask, An optional tensor of shape (seq_len, seq_len)

representing the mask for the input sequence. It is used to prevent the
decoder from attending to future tokens. The format should be.

tensor([[0., -inf, -inf], [0., 0., -inf], [0., 0., 0.]], device='mps:0')

In above example, seq_length=3.
where -inf signifies the tokens that need to be masked.
memory_mask, An optional tensor of shape (seq_len, src_seq_len)
representing the mask for the encoder output sequence. It is used to
prevent the decoder from attending future tokens in the encoder input
sequence.

tensor([[0., -inf, -inf], [0., 0., -inf], [0., 0., 0.]], device='mps:0')

in the above example, seq_length=3.
where -inf signifies the tokens that need to be masked. Usually, you
will not mask the memory: Thus, you will pass.

tensor([[0., 0, 0], [0., 0., 0], [0., 0., 0.]], device='mps:0')

tgt_key_padding_mask, An optional tensor of shape (batch_size,
seq_len) representing the mask for padding tokens in the input

sequence.

tensor([[False, False, False],
 [False, False, False],
 [False, True, False],
 [True, True, False]], device='mps:0')

In the above example, batch_size=4, seq_len=3. True signifies the
particular token is a padded token and masks it. False signifies the
particular token is not a padded token and do not masks it.

Accompanying notebook illustrates how to implement masking while you
create your PyTorch model.

Encoder Component of a Transformer

There are many use cases where you just need an encoder layer of the
transformer. Some of the examples are sentiment analysis, text classification,
NER and the like. Thus, PyTorch provides us the flexibility of using just the
encoder layer. Below is an example of a simple classification model using

TransformerEncoder and TransformerEncoderLayer.

class TextClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, nhead, num_layers, num_classes):
 super(TextClassifier, self).__init__()
 self.embedding = nn.Embedding(vocab_size, embedding_dim)

 self.positional_encoding = PositionalEncoding(embedding_dim)
 self.encoder_layer = nn.TransformerEncoderLayer(embedding_dim, nhead)
 self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers)
 self.fc = nn.Linear(embedding_dim, num_classes)

 self.embedding_dim=embedding_dim
 self.init_weights()
 def init_weights(self) -> None:
 initrange = 0.1
 self.embedding.weight.data.uniform_(-initrange, initrange)
 for layer in self.encoder.layers:
 
nn.init.xavier_uniform_(layer.self_attn.out_proj.weight)
nn.init.zeros_(layer.self_attn.out_proj.bias)
 
nn.init.xavier_uniform_(layer.linear1.weight)
 nn.init.zeros_(layer.linear1.bias)
 
nn.init.xavier_uniform_(layer.linear2.weight)
 nn.init.zeros_(layer.linear2.bias)
 self.fc.bias.data.zero_()
 self.fc.weight.data.uniform_(-initrange, initrange)
 def forward(self, x, key_padding_mask=None):
 x = self.embedding(x)* math.sqrt(self.embedding_dim)
 x = self.positional_encoding(x)
 x = self.encoder(x, src_key_padding_mask=key_padding_mask)

 # Pooling the last dimension and use the first token representation
 x = x.mean(dim=0)

 # Fully connected layer for classification
 x = self.fc(x)
 x=torch.sigmoid(x)
return x

Analyzing:

In lines 7 and 8, we are constructing a TransformerEncoder in two

steps.
First, we define a single encoder block using

TransformerEncoderLayer(embedding_dim, nhead). Important

consideration while choosing nhead—the division (embedding_dim //

n_head) should result in an integer remainder should be zero.
Second, we create the entire encoder by instantiating

TransformerEncoder and passing the TransformerEncoderLayer along
with the number of encoder blocks to be used.
Line 13 illustrates the weight initialization ensuring efficient learning
and improved convergence in neural networks. This is an essential

component, otherwise, you may notice the exploding gradient or
slow convergence.
In Line 35, the output of the last block of the encoder is passed to the
fully connected layer for classification.
It is interesting to understand what we are doing on line 32. After
passing the input through the embedding, positional encoding and
transformer encoder layers, the tensor x has a shape of
(sequence_length, batch_size, embedding_dim). We want to create a
fixed size representation of the entire sequence to feed into the Fully

Connected (FC) layer for classification. One simple way to do this is
to average the embeddings of all tokens in the sequence, which is
called mean pooling. To perform mean pooling, we use the mean()

function with the argument dim=0, which calculates the mean along
the sequence dimension. This reduces the tensor shape from
(sequence_length, batch_size, embedding_dim) to (batch_size,
embedding_dim).

The end-to-end implementation of text classification with IMDB dataset is
provided in the accompanying notebook.

Decoder Component of a Transformer

There are also many use cases where you just need a decoder layer of
transformer. Some examples are text generation, code generation, and music
generation. Thus, PyTorch provides the functionality of just using the

transformer’s decoder. Below is an example of simple text generation model
using just TransformerDecoderLayer and TransformerDecoder.

class TransformerDecoder(nn.Module):
 def __init__(self, vocab_size, embedding_dim,
num_layers, dropout):
 super().__init__()
 self.memory_embedding = nn.Embedding(vocab_size, embedding_dim)
 self.memory_pos_encoder = PositionalEncoding(embedding_dim, dropout)
 self.tgt_embedding = nn.Embedding(vocab_size, embedding_dim)
 self.tgt_pos_encoder = PositionalEncoding(embedding_dim, dropout)
 self.decoder = nn.TransformerDecoder(nn.TransformerDecoderLayer(d_model=embedding_dim, nhead=8, dim_feedforward=2048, dropout=dropout), num_layers=num_layers)
 self.fc = nn.Linear(embedding_dim, vocab_size)
self.d_model=embedding_dim
 def forward(self, tgt, memory=None, tgt_mask=None, memory_mask=None, memory_key_padding_mask=None,tgt_key_padding_mask=None):
 tgt = self.tgt_embedding(tgt) * self.d_model ** 0.5
 tgt=self.tgt_pos_encoder(tgt)
 print(tgt)
 memory=self.memory_embedding(memory) * self.d_model ** 0.5
 memory=self.memory_pos_encoder(memory)
 print(memory)
 output = self.decoder(tgt=tgt, memory=memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
memory_key_padding_mask=memory_key_padding_mask,tgt_key_padding_mask=tgt_key_padding_mask)
 print(output)
 output = self.fc(output)
 return output

Let us now delve into a discussion of the code snippet provided above and
examine its functionality.

This model is a transformer based decoder only language model,
which takes as input a target sequence (tgt) and an memory sequence
(memory) and generates an output sequence of the same length as the
input sequence.
The input target sequence is first passed through an embedding layer
and a positional encoding layer. Similarly, the input memory sequence
is passed through an embedding layer and a positional encoding layer.
During training:
memory, is training data of shape (seq_len, batch_size).
target, During model training, the target sequence would be the
input sequence shifted by one position.
These processed input sequences are then fed into the transformer
decoder, which consists of multiple transformer decoder layers. Each
decoder layer processes the input sequences using multi head self attention and a feed forward neural network.
Finally, the output of the transformer decoder is passed through a
linear layer fully connected neural network to generate the final
output sequence, with each element of the sequence representing the

probability distribution over the vocabulary of the target language.

Conclusion

Finally, positional encoding stands as a foundational pillar in transformer models, imbuing them with the ability to understand the order and structure of language which is crucial for generating contextually accurate outputs. Coupled with sophisticated masking techniques transformers can effectively manage sequences of varying lengths, ensuring that the attention mechanisms operate seamlessly. The encoder component of a transformer processes and encodes input data into meaningful representations while the decoder component leverages these representations to generate coherent and relevant output. This synergy of positional encoding, masking and the encoder decoder architecture underscores the transformative potential of transformers, driving unprecedented advancements in natural language processing and paving the way for innovative applications across diverse fields.