Mastering Data Preparation

Transforming raw data into actionable insights

7 min readAug 10, 2024

Imagine the thrill of transforming raw and unstructured data into a polished, insightful treasure trove ready for analysis. Data preparation is the unsung hero of the data science world, where the real magic happens behind the scenes. It’s the meticulous process of cleaning, organizing and structuring data to ensure that every bit of information is primed for powerful insights and accurate predictions. As we dive into the art and science of data preparation, get ready to unlock the secrets to turning data chaos into crystal clear clarity, setting the stage for transformative discoveries and impactful decisions.

Table of Content

Data preparation
Training
Chatbot with transformer

Data Preparation

Lines 14 and 16 in this code block illustrate that the dataset is a long

sequence of text num_rows=1. At the end of data preparation, this is what
each item of the dataset class looks like.

input_ids — Tokenized input text chunk. We will divide the whole text
into chunks of text for example, 1 tokens.
attention_mask — A binary mask indicating which tokens should be
attended to by the model during the forward pass.
labels — Input_ids shifted by 1 position. This is crucial for text

generation fine-tuning, as the training objective is to train the model to
predict the next word given the tokens at the current position.

Let us examine the most important aspect of this code. Lines 31 and 37, split the entire text into chunks, with each chunk consisting of 100 tokens.

import torch
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from transformers import GPT2Tokenizer

# Set the padding token
tokenizer.pad_token = tokenizer.eos_token

# Load the dataset
dataset = load_dataset("tiny_shakespeare")
'''
DatasetDict({train: Dataset({features: ['text'],num_rows:1})
 validation: Dataset({features: ['text'],num_rows:1})
 test: Dataset({features: ['text'], num_rows:1})})
'''

# Split the continuous text into smaller chunks
def split_text(text, max_length=100):
 return [text[i:i+max_length] for i in range(0,
len(text), max_length)]

# Apply the split_text function to the dataset
split_texts = split_text(dataset["train"]["text"][0])

# Tokenize the split_texts
tokenized_texts = tokenizer(split_texts,
return_tensors="pt", padding=True, truncation=True)

Let us take a closer look at the key aspect of the next code block. The

ShiftedDataset class demonstrates the custom dataset preparation process.
Our primary objective in fine-tuning is to provide text and predict the next
token. As a result, the input_ids consist of tokenized text chunks and the
labels represent the input text shifted by one position. Additionally, we

append an eos_token_id at the end of the labels.

class ShiftedDataset(Dataset):
 def __init__(self, encodings):
self.encodings = encodings
 def __getitem__(self, idx):
 input_ids = self.encodings["input_ids"][idx]
 attention_mask = self.encodings["attention_mask"][idx]
 labels = input_ids[1:].tolist() + [tokenizer.eos_token_id]
 return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": torch.tensor(labels)}
 def __len__(self):
 return len(self.encodings["input_ids"])

# Create a DataLoader
train_dataset = ShiftedDataset(tokenized_texts)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=4)

Training

In this code block, we are just preparing the data loader, model,

and optimizer for the accelerator. Another important thing is that we use the
LMHeadModel variant in this case, GPT2LMHeadModel when fine-tuning GPT-2

for text generation tasks for these reasons.

The LMHeadModel is designed explicitly for language modeling tasks,
which involve predicting the next token in a sequence of tokens. In the case of GPT-2, GPT2LMHeadModel is tailored for such tasks, making
It is suitable for text generation where the model needs to generate
coherent sequences of text.
The GPT2LMHeadModel adds the linear layer on top of the transformer for
the next word prediction.

from accelerate import Accelerator
from transformers import GPT2LMHeadModel

# Initialize the Accelerator
accelerator = Accelerator()

# Configure the training arguments
num_epochs = 20
learning_rate = 5e-5

# Initialize the GPT-2 model and optimizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Prepare the model and optimizer for training with Accelerator
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)

The important aspect of this code is that we are saving the model

every five epochs. The reasons are.

Checkpointing — Saving the model periodically creates checkpoints,
allowing you to resume training from the latest saved epoch.
Early stopping — If the performance on validation sets starts
degrading, we can implement the early stopping technique.

from transformers import AdamW
from tqdm import tqdm

# Fine-tuning loop
for epoch in range(num_epochs):
 epoch_iterator = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}")
 for step, batch in enumerate(epoch_iterator):
 optimizer.zero_grad()

 input_ids = batch["input_ids"]
 attention_mask = batch["attention_mask"]
 labels = batch["labels"] 

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
 loss = outputs.loss
 accelerator.backward(loss)

optimizer.step()
 if step % 500 == 0:
 epoch_iterator.set_postfix({"Loss":loss.item()}, refresh=True)

 # Save the model every 5 epochs
 if (epoch + 1) % 5 == 0:
 model_save_path = f" please provide your file path, where you want to save your model
/tiny_shakespeare/model_checkpoint_epoch_{epoch + 1}"
 model.save_pretrained(model_save_path)
 print(f"Model saved at epoch {epoch + 1}")

The model is ready now, you can use it to write a poem as if it was written
by Shakespeare. The end-to-end implementation of the model with the
The inference pipeline is included in the accompanying notebook.

Chatbot with Transformer

we will develop a tool similar to ChatGPT for your
organization. This type of model is known as an instruction following model
and we will delve into the reasons why it is essential for your organization.
An instruction following model is designed to comprehend and carry out

tasks based on natural language instructions. These models often form the
foundation for chatbots, as they allow the systems to understand and respond

to user instructions in a human -like manner. Let us explore why incorporating an instruction following model is crucial
for your organization’s competitive advantage.

Customized chatbot — You can create a transformer model tailored to
your organization’s data. Systems like ChatGPT do not possess your organizational data and context.
Security and Privacy — Your organization’s chatbot will remain within
your firewall, ensuring that no data leaves your organization’s
network.

Under the hood, instruction following models encompass various types of
transformer models, such as QA, TAPAS, Summarization and more.

Nevertheless, we will implement the instruction following the model using only
a transformer fine tuned for QA tasks. You can build upon this concept to
include other types of transformers.

Clinical Question Answering Transformer — Project

1. Setup and Installation

First, make sure you have the required libraries installed. You can use a pip to install them.

pip install transformers 
pip install datasets 
pip install torch

2. Prepare the Data

For this project, we will use the squad dataset from Hugging Face’s datasets library, which contains question-answer pairs. Although it’s not specifically clinical, it will serve as a good starting point. You can replace it with a clinical QA dataset if available.

from datasets import load_dataset

# Load the SQuAD dataset
dataset = load_dataset('squad')
print(dataset['train'][0])

3. Tokenization and Dataset Preparation

We will use the BERT tokenizer to prepare our data for the model.

from transformers import BertTokenizer

# Load the pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')

def preprocess_function(examples):
    return tokenizer(
        examples['question'],
        examples['context'],
        truncation=True,
        padding='max_length',
        max_length=384,
        return_tensors='pt'
    )

# Tokenize the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)

4. Model Preparation

We will use a pre-trained BERT model with a question-answering head.

5. Training the Model

Set up the training arguments and train the model using the Trainer API.

from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    evaluation_strategy="epoch",     # evaluation strategy to use
    learning_rate=2e-5,              # learning rate
    per_device_train_batch_size=4,   # batch size for training
    per_device_eval_batch_size=4,    # batch size for evaluation
    num_train_epochs=3,              # number of training epochs
    weight_decay=0.01,               # strength of weight decay
)

# Initialize Trainer
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_datasets['train'],   # training dataset
    eval_dataset=tokenized_datasets['validation']  # evaluation dataset)

# Train the model
trainer.train()

6. Evaluate the Model

Evaluate the performance of your model on the validation set.

# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)

7. Make Predictions

Use the fine-tuned model to answer questions based on provided contexts.

# Example context and question
context = "The immune system is a network of cells, tissues, and organs that work together to defend the body against pathogens."
question = "What is the immune system?"

# Tokenize the input
inputs = tokenizer(question, context, return_tensors='pt')

# Generate the answer
outputs = model(**inputs)
start_position = outputs.start_logits.argmax()
end_position = outputs.end_logits.argmax()

# Convert token IDs to text
answer_tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_position:end_position + 1])
answer = tokenizer.convert_tokens_to_string(answer_tokens)
print(f"Question: {question}")
print(f"Answer: {answer}")

8. Save and Load the Model

Save the fine-tuned model for future use.

Conclusion

Finally, we will be crafting the concept of data preparation and along with that approach, we also understand the concept of the transformers model and we practically see, how the model provides better results of our prompt. We have created a clinical question answering system using a pre-trained BERT model. This project demonstrates the power of transformers in understanding and responding to complex questions based on given contexts. You can further enhance this system by incorporating domain specific clinical datasets and fine-tuning the model for more accurate and relevant responses in medical contexts.