Mastering Text Classification

Choosing the ultimate architecture for superior performance

5 min readAug 10, 2024

In the fast paced world of artificial intelligence, choosing the right architecture for text classification can be the difference between average results and ground breaking performance. With the explosion of data and the increasing complexity of language tasks, finding the most appropriate model architecture is crucial for turning raw text into actionable insights. Whether you are aiming for precision in sentiment analysis or speed in spam detection, the architecture you choose will define the success of your project. Get ready to dive into the cutting-edge frameworks that are not just meeting, but exceeding, the demands of modern text classification.

Table of Content

Most appropriate architecture for text classification
Text classification via fine-tuning transformer
Handling long sequence

Most Appropriate Architecture For Text Classification

Generally, encoder only models or BERT variations are most appropriate for
text classification for these reasons.

Focus on understanding input text — BERT variations are designed to
understand input tokens by creating contextualized embeddings for

each token.
Bidirectional context — BERT and its variations are pre-trained to
understand bidirectional context, as opposed to autoregressive models
such as GPT. This helps to understand the context of language from
both directions.
Efficiency — Encoder-only models can be used for text classification by
simply adding a fully connected layer, whereas decoder only models
may require more complex adaptations.

Text Classification via Fine-tuning Transformer

Figure 1.1 shows the outline of text classification by fine tuning the existing
language model.

Figure 1.1 - Text classification process

Handling Long Sequence

The majority of transformer architectures have a maximum limit on the
sequence length they can handle. For example, the max sequence length for
BERT is 512 tokens. Transformers have a low maximum sequence length

because of the quadratic complexity of self-attention computation.
Nevertheless, the real-world text data that you find in your company can
often be longer than the maximum sequence length that a transformer model
can handle. Thus, we need to find effective strategies to handle the max

sequence length. In this section, we enlist a few of them.

Truncate — If a sequence is longer than the model’s maximum
sequence length, you can simply truncate it. This is the easiest
approach but may result in a loss of information and poor performance
for tasks that require understanding the entire context.
Chunking — Divide the long sequences into non-overlapping chunks
and process the self-attention individually. You can combine the
outputs using various strategies such as mean, max pooling or
concatenation. This approach may lose information related to context

between the chunks.
Hierarchical approach — Create a hierarchical structure by dividing
long sequences into sentences or paragraphs. Then, encode each
sentence or paragraph into a fixed-size encoding. Afterward, perform attention on sentence or paragraph representations. This allows the
model to capture both local and global attention.
Custom architecture — Some transformers, like LongFormer (max-seq_len=4096) and BigBird(max-seq_len=4096), are specifically
designed to handle long sequences. These architectures use a
combination of local and global attention so that the overall
complexity of attention computation is not quadratic.

In the real world, you will experiment with various approaches and also
consider the importance of capturing the entire context and resource
availability to choose the appropriate approach. Here, we will do two

projects where we will explore the mechanism to handle long sequences via
document chunking and Hierarchal Attention.

Document Chunking — Project

In this project, we will fine-tune the BERT-base-uncased model to predict
sentiment in the IMDB dataset. These steps provide an overview of
the model architecture.

Divide long text into smaller chunks — The code splits extensive text
data into smaller, more manageable chunks or sentences. This step is
crucial for handling long sequences effectively.
Process each chunk with BERT — Each of these smaller chunks is
then individually processed through the BERT model. This processing
generates a vector representation for each sentence, capturing its
essential features and meaning.
Create a composite representation — Finally, the code averages these
vector representations from all chunks to form a single,
comprehensive representation of the entire long text. This average
representation encapsulates the overall context or sentiment of the
text.

The complete end-to-end project implementation is provided in the

accompanying notebook.

Hierarchical Attention — Project

Similar to the previous project, we will calculate the sentiment score on
IMDB dataset. However, instead of document chunking, we will use the
Hierarchical attention mechanism.

Hierarchical attention — This model uses a two-level hierarchical.

Local attention
Global attention

Local Attention

Applied to individual sentences in a document to
create sentence representations.

Global Attention

Applied to sentence representations to create a

document representation.

2. Sentence Representation

Reshape the data to have dimensions (batch_size *
num_sentences, hidden_size) Pass input_ids and attention_mask
to the ALBERT model to get hidden states

(outputs.last_hidden_state).
Apply the attention layer (self.attention) to the hidden states,
followed by softmax function to compute attention weights
(attention_weights). The attention_weights gives the weight of
each token.
Calculate the sentence representation by multiplying hidden states
with their corresponding attention weights and summing along the
sequence dimension (torch.sum(attention_weights*

hidden_states, dim=1)).
Reshape the sentence_representation tensor to have dimensions

(batch_size, num_sentences, hidden_size).

3. Document Representation

Apply the attention layer (self.attention) to the sentence

representations, followed by a softmax function to compute

document-level attention weights(doc_attention_weights)
Follow the same method to create document representation.

The complete end-to-end project implementation is provided in the

accompanying notebook.

Conclusion

Finally, we wrap up our deep dive into the most appropriate architectures for text classification, it’s clear that the right choice can dramatically enhance your model’s performance. Fine-tuning transformers have emerged as a powerful strategy, enabling models to adapt to specific tasks with unparalleled precision. However, handling long sequences remains a challenge, which is why innovative approaches like document chunking and hierarchical attention are crucial.

These techniques not only allow us to manage extensive text inputs but also ensure that critical context is preserved, leading to more accurate and meaningful classifications. By integrating these advanced methods into your projects, you are not just building effective text classifiers, you are pushing the boundaries of what AI can achieve in understanding and processing language at scale.