Unleashing the Power of Tabular Data

The revolutionary impact of tab transformers architecture

6 min readAug 13, 2024

In the dynamic landscape of artificial intelligence, where innovation is the norm, the Tab Transformer architecture is a game-changer for tabular data processing. While traditional methods have long dominated this field, Tab Transformers are redefining what’s possible by harnessing the power of transformers to unlock deeper insights, unprecedented accuracy and new levels of flexibility. This architecture is not just an incremental improvement, it’s a bold leap forward, transforming how we approach the most structured and essential forms of data in ways that were once unimaginable. Welcome to the future of tabular data processing, where Tab Transformers are rewriting the rules.

Table of Content

Tab transformer architecture
FT transformer architecture
Feature tokenizer
Concatenation of numerical and categorical future
Transformer

Tab Transformer Architecture

The fundamental concept anchoring the Tab Transformer is the generation of
contextual embeddings for categorical variables. Let us delve into the details
of this architecture.

Categorical embeddings — Each categorical feature, denoted as xi, is
transformed into a parametric embedding of dimension d using a

process known as column embedding.
Transformer encoder — These embeddings of categorical features are
then passed to a transformer encoder, which treats each categorical
feature as a token or “word” in a sequence. This enables the model to
understand and learn complex interactions between different
categorical features.
Contextual embeddings — Inside the transformer encoder, a self-attention mechanism is used to develop contextual embeddings for the

categorical variables. The self-attention mechanism helps the model to
weigh the importance and interaction of each categorical feature with

every other feature within a given instance (row). It is a pivotal aspect
as it allows the model to capture complex interdependencies among

the categorical features.
Concatenation of contextual embeddings and normalized

numerical variables — Once the transformer has created contextual
embeddings for the categorical variables, these are concatenated with
the normalized numerical variables. This creates a comprehensive
feature set where both categorical and numerical variables are taken

into account but the former has been enriched with contextual

information captured by the transformer.
Multilayer Perceptron (MLP) — The concatenated data is then passed
to a MLP for the final prediction. The MLP serves as the final
classifier or regressor, depending on the specific task.
Pretraining and fine-tuning: Like many successful transformer
based models, TabTransformer employs a two-step process of

pre-training and fine-tuning. During pretraining, the model is trained
on a large dataset with a reconstruction objective, learning to predict
masked (hidden) columns. Once this pretraining step is complete, the
model is then fine-tuned on the specific task, optimizing for the target
objective for example, classification or regression.

By leveraging the strengths of transformer architectures for handling
categorical features in tabular data, Tab Transformer can effectively model
intricate feature relationships, leading to high performance predictions.

In Figure 1.1 for a visual representation of the

TabTransformer’s architecture.

Figure 1.1 - TabTransformer architecture

FT Transformer Architecture

The main idea is to create the embedding of both numerical and categorical
features and pass to the transformer encoder. This approach ensures a more contextually rich representation of the input data than the Tab Transformer, as
it calculates self-attention across both numerical and categorical features. In
contrast, the Tab Transformer only applies self-attention to categorical

features. Let us now go over each component in detail.

Feature Tokenizer

The Feature tokenizer module is a component of the FT-Transformer model
that is responsible for converting input features into embeddings.
As shown in Figure 1.2, the conversion process into embedding happens
differently for numerical and categorical data.

Numerical features — For each numerical feature xj, the transformation
involves an element-wise multiplication of the feature value xj by a

learned weight vector Wj and then an addition of a bias term bj. This
is represented as: Tj = bj + xj * Wj. The multiplication by Wj allows
the model to scale and adjust the influence of the numerical feature.
The bias term bj allows the model to have a base representation of the
feature from which adjustments can be made. For numerical features,
Wj is a weight vector with dimensionality equal to the desired
dimensionality d of the feature embeddings. This is represented as
W(num)j ∈R^d.
Categorical features — For each categorical feature, the transformation
involves a lookup in an embedding table Wj for the category in feature

xj. The bias term bj is then added. A one-hot vector eTj is used to
perform the lookup in the table, which retrieves the embedding for the
specific category in the feature. This is represented as, Tj = bj + eTj *

Wj. This method effectively gives each category in a feature its unique
embedding in the d-dimensional space. For categorical features, Wj is

an embedding lookup table. If Sj represents the number of unique
categories for the j-th categorical feature, then the embedding lookup
table Wj for this feature would have dimensions Sj by d. This is

represented as W(cat)j ∈R^Sj×d.

Therefore, in the resulting embeddings, each feature, whether numerical or
categorical, is represented in the same d-dimensional space, which makes it possible to process them uniformly in the subsequent Transformer stages of
the model.

Concatenation Numerical and Categorical Feature

The numerical and categorical Feature embedding is concatenated. The
concatenated sequence is represented by T. Then, [CLS] token is added at

the beginning of the sequence. The input to the Transformer will be

T= stack (T, [CLS]).

Transformer

The input sequence is processed through the transformer encoder, which
mirrors the original transformer design as proposed by Vaswani and
colleagues. A classification or regression head, dependent on the task at
hand is affixed to the first token emanating from the final layer of the
transformer encoder. Figure 1.2 depicts the architecture of the FT Transformer.

Conclusion

Finally, we navigate the cutting edge innovations in Tab Transformer and FT Transformer architectures, it’s clear that the future of tabular data processing is being rewritten. The feature tokenizer and the seamless concatenation of numerical and categorical features have revolutionized the way we handle structured data, breaking free from the constraints of traditional methods. With transformers at the core, these architectures merge the strengths of both worlds, delivering unparalleled performance and insight. This convergence not only enhances data representation but also sets a new standard for accuracy and efficiency in machine learning tasks. The fusion of these technologies marks a pivotal moment in AI, where the power of transformers is fully unleashed transforming how we interpret and utilize tabular data in ways that are both groundbreaking and game changing.