Revolutionizing Communication with Cutting-Edge Speech Processing

Unleashing the power of speech

10 min readAug 8, 2024

Imagine a world where machines understand and respond to human speech as effortlessly as a conversation between friends. Speech processing has transcended the realm of science fiction to become a transformative force in technology. From virtual assistants that anticipate your needs to real time language translation that bridges communication gaps, speech processing is at the forefront of innovation, revolutionizing how we interact with machines. Buckle up as we explore how this groundbreaking technology is reshaping our digital landscape and opening new frontiers in human computer interaction.

Table of Content

Speech processing
Develop classifier by fine tuning BERT-base-uncased
Custom dataset class
Data loader
Inference

Speech Processing

Here, is a list of some widely-used speech processing pre-trained models:

Wav2Vec 2.0 — A transformer based model for self-supervised speech
recognition that learns speech representations directly from raw audio
data.
Conformer — A hybrid model that combines convolutional, recurrent
and self-attention mechanisms, used for various speech processing
tasks, such as automatic speech recognition and keyword spotting.

Develop Classifier by Fine Tuning BERT-base-uncased

We have covered the basics of transfer learning. Now, let us create a
classifier fine-tuning the BERT-uncased model. We will build the real news
vs fake news detection engine. We want to demonstrate how this pipeline

can be adapted to your organization’s specific needs. Instead of using a pre-built dataset, we will download a dataset from Kaggle and utilize it in our
fine tuning process. This approach will help illustrate how the pipeline can
be tailored to work with custom datasets in real-world applications. Figure

1.1 shows an outline of the fine tuning process.

Figure 1.1 - Outline for fine-tunning NLP classifier

The steps below guide you through the fine tuning process:

Import required libraries and packages using this code
snippet for training a sequence classification model using the

Hugging Face transformers library and PyTorch.

# importing necessary packages/libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from accelerate import Accelerator
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from tqdm import tqdm
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import AdamW
from transformers import get_scheduler

Now, let us set up device. The code defines a function get_device()

that checks the available hardware CUDA, Apple Metal Performance

Shaders, or CPU and returns the appropriate device for PyTorch

tensor operations.

def get_device():
 device="cpu"
 if torch.cuda.is_available():
 device="cuda"
 elif torch.backends.mps.is_available():
 device='mps'
 else:
 device="cpu"
 return device
device = get_device()
print(device)

2. Load dataset — Firstly, you can download the dataset from Kaggle by clicking the link below. Then perform the data cleaning. In this code, we are conducting these operations.

https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-

news-dataset

Reading data from two CSV files, True.csv (real news) and

Fake.csv (fake news).
Cleaning and preprocessing the data in each CSV file.
Concatenating both data frames into a single data frame.
The resulting data frame contains two columns , text for the
news content and label for its corresponding category (real or
fake.

real = pd.read_csv('provide your data file path, that you can save in your computer')
fake = pd.read_csv('same as above')

real = real.drop(['title','subject','date'], axis=1)
real['label'] = 1.0

fake = fake.drop(['title','subject','date'], axis=1)
fake['label']= 0.0

dataframe = pd.concat([real, fake], axis=0, ignore_index=True)
df = dataframe.sample(frac=0.1).reset_index(drop=True)

3. Load pre-trained tokenizer — We will utilize the bert base uncased
as our pre-trained model for fine tuning. As a result, it is essential to
use the corresponding tokenizer to ensure that the input data is
properly processed and compatible with the model. If an incorrect
tokenizer is used, the data fed into the model will be inadequate or
incorrect, negatively affecting the training process and resulting in
suboptimal performance.

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

4. Prepare dataset — The data preparation process for BERT-based-uncased models involves tokenizing the text, mapping tokens to
input_ids, creating attention masks and preparing the tensor labels. Each item of the dataset class should be a
dictionary of the structure below.

{'input_ids': torch.Tensor(),'attention_mask':torch.Tensor() ,'labels': torch.Tensor()}

Let us discuss the component of the above dictionary:

input_ids — Each token from the tokenized text needs to be mapped to
an ID using BERT’s vocabulary. The resulting input IDs should be in
the form of a tensor or array with specified shape(batch_size,max_sequence_length).
attention_mask — The attention mask is used to differentiate between
the actual tokens and padding tokens. It has the same shape as the
input_ids tensor, that is, batch_size, max_sequence_length. The

mask has 1s for actual tokens and 0s for padding tokens.
labels — The labels tensor contains the true class for each example in
the dataset. It usually has the shape of batch_size. For classification

tasks, these labels are one-hot-encoded labels.

This code illustrates the data processing. The output of this

code is three lists, input_ids, attention_mask and labels for both
the training and the validation dataset.

# this is just creating a list of tuples. Each
tuple has (text, label)
data=list(zip(df['text'].tolist(), df['label'].tolist()))

# This function takes two lists as Parameter
# This function return input_ids, attention_mask,
and labels_out
def tokenize_and_encode(texts, labels):
input_ids, attention_masks, labels_out = [], [], []

 for text, label in zip(texts, labels):
 encoded = tokenizer.encode_plus(text, max_length=512, padding='max_length', truncation=True)
 input_ids.append(encoded['input_ids'])
 
attention_masks.append(encoded['attention_mask'])
 labels_out.append(label)
 return torch.tensor(input_ids), torch.tensor(attention_masks), torch.tensor(labels_out)

# seprate the tuples
# generate two lists: a) containing texts, b)
containing labels
texts, labels = zip(*data)

# train, validation split
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2)

# tokenization
train_input_ids, train_attention_masks,
train_labels = tokenize_and_encode(train_texts, train_labels)
val_input_ids, val_attention_masks, val_labels = tokenize_and_encode(val_texts, val_labels)

Custom Dataset

Let us write a custom dataset class:

class
TextClassificationDataset(torch.utils.data.Dataset):

 def __init__(self, input_ids, attention_masks, labels, num_classes=2):
 self.input_ids = input_ids
 self.attention_masks = attention_masks
 self.labels = labels
 self.num_classes = num_classes
 self.one_hot_labels = self.one_hot_encode(labels, num_classes)
 def __len__(self):
 return len(self.input_ids)
 def __getitem__(self, idx):
 return {'input_ids': self.input_ids[idx], 'attention_mask': self.attention_masks[idx], 'labels': self.one_hot_labels[idx]}

 @staticmethod
 def one_hot_encode(targets, num_classes):
 targets = targets.long()
 one_hot_targets = torch.zeros(targets.size(0), num_classes)
 one_hot_targets.scatter_(1,targets.unsqueeze(1), 1.0)
 return one_hot_targets

train_dataset = TextClassificationDataset(train_input_ids, train_attention_masks, train_labels)
val_dataset = TextClassificationDataset(val_input_ids, val_attention_masks, val_labels)

Let us discuss what the above code is doing:

For tunning BERT-based-uncased, Each item of dataset must be of

type dictionary with at least the keys below.

input_ids
attention_mask
labels

The__getitem__should return a dictionary of this structure.

{'input_ids': self.input_ids[idx], 'attention_mask': self.attention_masks[idx], 'labels': self.one_hot_labels[idx]}

one_hot_encode method — A static method that takes in targets labels
and num_classes as arguments. It converts the given targets into one-hot encoded tensors. The method first converts the targets to long
tensors and then initializes a zero tensor of shape number of samples,

num_classes. The scatter_ function is used to place 1.0 in the

appropriate position for each sample’s label, resulting in a one-hot
encoded tensor.

Data Loader

Now, let us create dataloader that we can feed to our fine tunning task:

train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
eval_dataloader = DataLoader(val_dataset, batch_size=8)

The encoder expects data with

dimensions (seq_len, batch_size). However, Hugging Face’s BERT-based-uncased model requires data with dimensions (batch_size, seq_len). As a

result, the output from the train_dataloader has dimensions of (batch_size, seq_len). We can execute below code to review the dimension of dataloader.

item=next(iter(train_dataloader))
item_ids,item_mask,item_labels=item['input_ids'],item['attention_mask'],item['labels']

print ('item_ids, ',item_ids.shape, '\n', 'item_mask, ',item_mask.shape, '\n', 'item_labels, ',item_labels.shape, '\n',)

Output:

item_ids, torch.Size([8, 512]) 
item_mask, torch.Size([8, 512]) 
item_labels, torch.Size([8, 2])

This is aligned with the shape requirement for fine tuning BERT-based-uncased.

Load pre-trained BERT-based-uncased, There are two important
concepts in below code.

In this step, we are loading the BERT-base-uncased model using
the AutoModelForSequenceClassification class, which is a convenient way to add a final fully connected layer to the

Transformer architecture for the classification task. By doing so,
we adapt the pre-trained model to handle our specific classification
problem.
Additionally, we are initializing the AdamW optimizer, which is a
popular optimization algorithm for training deep learning models,
specifically designed for training Transformer models.

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr = 5e-5)

2. Prepare accelerator, Let us take a moment to discuss the accelerator

and the benefits it offers when training deep learning models. The
accelerator delivers a user-friendly API for training various deep
learning models with ease. It offers two main advantages that make it
a valuable tool for the training process.

Flexibility to conduct training on various hardware accelerators,
such as GPUs, TPUs and Apple’s Metal Performance Shaders. In our example, during training, we do not specifically
select ‘mps’ device. The accelerator automatically detects it and

uses mps for training.
The accelerator library is particularly useful for distributed
training and mixed precision training.

This code is a general syntax for preparing the accelerator:

# Declare accelerator
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)

3. Fine tune the model, This code describes the fine-tuning

process.

num_epochs = 1
num_training_steps = num_epochs *
len(train_dataloader)
lr_scheduler = get_scheduler("linear" ,optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
for batch in train_dataloader:
 outputs = model(**batch)
 loss = outputs.loss
 accelerator.backward(loss)
 optimizer.step()
 lr_scheduler.step()
 optimizer.zero_grad()
 progress_bar.update(1)
 model.eval()
 device = 'mps'
 preds = []
 out_label_ids = []

 for batch in eval_dataloader:
 with torch.no_grad():
 inputs = {k: v.to(device) for k, v
in batch.items()}
 outputs = model(**inputs)
 logits = outputs.logits
 
preds.extend(torch.argmax(logits.detach().cpu() , dim=1).numpy())
out_label_ids.extend(torch.argmax(inputs["labels"].detach().cpu(),dim=1).numpy())
 accuracy = accuracy_score(out_label_ids, preds)
 f1 = f1_score(out_label_ids, preds, average='weighted')
 recall = recall_score(out_label_ids, preds, average='weighted')
 precision = precision_score(out_label_ids, preds, average='weighted')

 print(f"Epoch {epoch + 1}/{num_epochs} Evaluation Results:")
 print(f"Accuracy: {accuracy}")
 print(f"F1 Score: {f1}")
 print(f"Recall: {recall}")
 print(f"Precision: {precision}")

Now, let us discuss what we are doing in the above code:

lr_scheduler in the provided code is an instance of a learning rate
scheduler, which is responsible for adjusting the learning rate during
the training process. The learning rate scheduler helps improve the
training process by dynamically adjusting the learning rate based on

the number of training steps. In this code, the learning rate starts with
the initial value set in the optimizer and decreases linearly to 0 as the
training progresses. Some benefits of lr_scheduler over optimizer

alone are.
Avoid overshooting, When using a fixed learning rate, the optimizer
might overshoot the optimal solution, especially in the later stages of
training. By decreasing the learning rate over time, the model can

make smaller updates and fine-tune its weights.
progress_bar is just a utility to show the progress of training.

This code block is the standard syntax for fine-tuning:

outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)

You can notice that during training, we are not explicitly converting tensor
into the device. accelerator is automatically identifying the device and

converts tensor into the appropriate format.
After each epoch, we are also printing the evaluation metrics over the

evaluation dataset.
This output demonstrates the results of fine-tuning our classifier.

We have successfully created a capable classifier for distinguishing between

real and fake news. However, it is worth noting that the dataset contains
news provider names for example ABC, CBS for real news articles. This
might lead the model to rely on such information, resulting in exceptionally

high performance.

Figure 1.3 - The output of fine-tuning process

Inference

We have developed a machine learning model to distinguish between real
and fake news. Now it is time to create an inference pipeline that allows us
to input any text passage, and the model will return a result indicating

whether the given text block belongs to real news or fake news.

Let us discuss some crucial points in the below code:

tokenizer = BertTokenizer.from_pretrained(’bert-base-uncased’),
You need to use the same tokenizer that was used for fine-tunning.
logits.detach().cpu()

detach is done to prevent unintentional back propagation.
cpu is done so that the output is compatible with scikit-learn

libraries for further computation.

from transformers import BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def inference(text, model, label, device='mps'):

 # Load the tokenizer
 # Tokenize the input text
 inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

 # Move input tensors to the specified
device (default: 'cpu')
inputs = {k: v.to(device) for k, v in inputs.items()}

 # Set the model to evaluation mode and
perform inference
 model.eval()
 with torch.no_grad():
 outputs = model(**inputs)
 logits = outputs.logits

 # Get the index of the predicted label
 pred_label_idx = torch.argmax(logits.detach().cpu(), dim=1).item()
 print(f"Predicted label index:{pred_label_idx}, actual label {label}")
 return pred_label_idx

Now let us use the inference pipeline.

# Example usage
text="CNN (Washington) General Motors plans to
phase out widely used Apple (AAPL) CarPlay and
Android Auto technologies that allow drivers to
bypass a vehicle's infotainment system, shifting
instead to built-in infotainment systems developed
with Google (GOOG) for future electric vehicles."
pred_label_idx = inference(text, model, 1.0)

Output:

Predicted label index: 1, actual label 1.0

This is the correct output, as the news article was retrieved from CNN.

Conclusion

Finally, we conclude our exploration into the realm of speech processing, it’s clear that this technology is not just a trend but a profound shift in how we interact with machines. By embarking on a project to develop a classifier through fine-tuning BERT Base Uncased, you have harnessed the power of cutting edge models to refine your understanding and application of natural language processing. The creation of a custom dataset class and data loader has equipped you with the tools to handle and preprocess data with precision, ensuring that your model operates at peak performance. With these components in place, you are now ready to push the boundaries of inference, transforming raw speech data into actionable insights. This journey through speech processing not only showcases the power of modern AI techniques but also positions you at the forefront of technological innovation, ready to leverage these advancements for groundbreaking applications.

Revolutionizing Communication with Cutting-Edge Speech Processing

Unleashing the power of speech

Table of Content

Speech Processing

Develop Classifier by Fine Tuning BERT-base-uncased

Custom Dataset

Data Loader

Output:

Inference

Output:

Conclusion

Written by A.I Hub

No responses yet