Mastering Text Classification

Choosing the ultimate architecture for superior performance

A.I Hub
5 min readAug 10, 2024
Image owned by Levity ai

In the fast paced world of artificial intelligence, choosing the right architecture for text classification can be the difference between average results and ground breaking performance. With the explosion of data and the increasing complexity of language tasks, finding the most appropriate model architecture is crucial for turning raw text into actionable insights. Whether you are aiming for precision in sentiment analysis or speed in spam detection, the architecture you choose will define the success of your project. Get ready to dive into the cutting-edge frameworks that are not just meeting, but exceeding, the demands of modern text classification.

Table of Content

  • Most appropriate architecture for text classification
  • Text classification via fine-tuning transformer
  • Handling long sequence

Most Appropriate Architecture For Text Classification

Image owned by Analytics Vidhya

Generally, encoder only models or BERT variations are most appropriate for
text classification for these reasons.

  1. Focus on understanding input text — BERT variations are designed to
    understand input tokens by creating contextualized embeddings for

    each token.
  2. Bidirectional context — BERT and its variations are pre-trained to
    understand bidirectional context, as opposed to autoregressive models
    such as GPT. This helps to understand the context of language from
    both directions.
  3. Efficiency — Encoder-only models can be used for text classification by
    simply adding a fully connected layer, whereas decoder only models
    may require more complex adaptations.

Text Classification via Fine-tuning Transformer

Image owned by debomon

Figure 1.1 shows the outline of text classification by fine tuning the existing
language model.

Figure 1.1 - Text classification process

Handling Long Sequence

Image owned by spiceworks

The majority of transformer architectures have a maximum limit on the
sequence length they can handle. For example, the max sequence length for
BERT is 512 tokens. Transformers have a low maximum sequence length

because of the quadratic complexity of self-attention computation.
Nevertheless, the real-world text data that you find in your company can
often be longer than the maximum sequence length that a transformer model
can handle. Thus, we need to find effective strategies to handle the max

sequence length. In this section, we enlist a few of them.

  1. Truncate — If a sequence is longer than the model’s maximum
    sequence length, you can simply truncate it. This is the easiest
    approach but may result in a loss of information and poor performance
    for tasks that require understanding the entire context.
  2. Chunking — Divide the long sequences into non-overlapping chunks
    and process the self-attention individually. You can combine the
    outputs using various strategies such as mean, max pooling or
    concatenation. This approach may lose information related to context

    between the chunks.
  3. Hierarchical approach — Create a hierarchical structure by dividing
    long sequences into sentences or paragraphs. Then, encode each
    sentence or paragraph into a fixed-size encoding. Afterward, perform attention on sentence or paragraph representations. This allows the
    model to capture both local and global attention.
  4. Custom architecture — Some transformers, like LongFormer (max-seq_len=4096) and BigBird(max-seq_len=4096), are specifically
    designed to handle long sequences. These architectures use a
    combination of local and global attention so that the overall
    complexity of attention computation is not quadratic.

In the real world, you will experiment with various approaches and also
consider the importance of capturing the entire context and resource
availability to choose the appropriate approach. Here, we will do two

projects where we will explore the mechanism to handle long sequences via
document chunking and Hierarchal Attention.

Document Chunking — Project

In this project, we will fine-tune the BERT-base-uncased model to predict
sentiment in the IMDB dataset. These steps provide an overview of
the model architecture.

  1. Divide long text into smaller chunks — The code splits extensive text
    data into smaller, more manageable chunks or sentences. This step is
    crucial for handling long sequences effectively.
  2. Process each chunk with BERT — Each of these smaller chunks is
    then individually processed through the BERT model. This processing
    generates a vector representation for each sentence, capturing its
    essential features and meaning.
  3. Create a composite representation — Finally, the code averages these
    vector representations from all chunks to form a single,
    comprehensive representation of the entire long text. This average
    representation encapsulates the overall context or sentiment of the
    text.

The complete end-to-end project implementation is provided in the

accompanying notebook.

Hierarchical Attention — Project

Similar to the previous project, we will calculate the sentiment score on
IMDB dataset. However, instead of document chunking, we will use the
Hierarchical attention mechanism.

  1. Hierarchical attention — This model uses a two-level hierarchical.
  • Local attention
  • Global attention

Local Attention

Applied to individual sentences in a document to
create sentence representations.

Global Attention

Applied to sentence representations to create a

document representation.

2. Sentence Representation

  • Reshape the data to have dimensions (batch_size *
    num_sentences, hidden_size)
    Pass input_ids and attention_mask
    to the ALBERT model to get hidden states

    (outputs.last_hidden_state).
  • Apply the attention layer (self.attention) to the hidden states,
    followed by softmax function to compute attention weights
    (attention_weights). The attention_weights gives the weight of
    each token.
  • Calculate the sentence representation by multiplying hidden states
    with their corresponding attention weights and summing along the
    sequence dimension (torch.sum(attention_weights*

    hidden_states, dim=1))
    .
  • Reshape the sentence_representation tensor to have dimensions

    (batch_size, num_sentences, hidden_size).

3. Document Representation

  • Apply the attention layer (self.attention) to the sentence

    representations, followed by a softmax function to compute

    document-level attention weights(doc_attention_weights)
  • Follow the same method to create document representation.

The complete end-to-end project implementation is provided in the

accompanying notebook.

Conclusion

Finally, we wrap up our deep dive into the most appropriate architectures for text classification, it’s clear that the right choice can dramatically enhance your model’s performance. Fine-tuning transformers have emerged as a powerful strategy, enabling models to adapt to specific tasks with unparalleled precision. However, handling long sequences remains a challenge, which is why innovative approaches like document chunking and hierarchical attention are crucial.

These techniques not only allow us to manage extensive text inputs but also ensure that critical context is preserved, leading to more accurate and meaningful classifications. By integrating these advanced methods into your projects, you are not just building effective text classifiers, you are pushing the boundaries of what AI can achieve in understanding and processing language at scale.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet