Decoding Speech Processing Inside

The minds of whisper, speechT5 and wav2vec

6 min readAug 12, 2024

In a world where voice commands and virtual assistants have become the norm, understanding the anatomy of speech processing models is no longer just for AI experts, it’s for everyone. Enter Whisper, SpeechT5 and Wav2Vec, three groundbreaking models that are not only pushing the boundaries of what machines can hear, but also transforming how they comprehend and interact with the human voice. In this deep dive, we will unravel the intricacies of these models, revealing the revolutionary technology that’s redefining speech recognition and synthesis in the digital era. Get ready to explore the cutting edge of AI, where the future of communication is being forged.

Table of Content

Introduction
System requirements
Speech processing
Examples of speech pre-processing

Introduction

Welcome to the exploration of speech processing using the transformer. It is
one of the less mature yet rapidly growing fields of artificial intelligence,

boasting a wide range of applications, including automated transcription,
automated voice translation, speaker identification and audio generation.
Recently, transformer architectures, like Whisper have outperformed

traditional speech processing techniques. In this section, we will delve into
the three most important speech processing transformer architectures and

illustrate those with practical examples as well.

System Requirements

Setting Up Environment:

Install Anaconda on the local machine.
Create a virtual environment.
Install necessary packages in the virtual environment.
Configure and Start Jupyter Notebook.
Connect Google Colab with your local runtime environment.

Installing Anaconda On Local System

Go to the Anaconda download page. https://www.anaconda.com/products/distribution
Download the appropriate version for your computer.
Follow the instructions provided by the installer.
If the installer prompts you to add anaconda in the system’s PATH variable; please do it. This enables you to seamlessly use Anaconda’s features from the command line.
Check if installation is successful by typing the following command in the terminal.

conda --version

Creating a Virtual Environment:

To create a virtual environment in Anaconda via the terminal, follow these steps.

Open the terminal on your local machine.
Type the following command and press Enter to create a new virtual environment. In the below code, virtual environment name is torch_learn and python version is 3.11.

conda create --name 
torch_learn
 python=3.11

3. Once the environment has been created, activate it by typing the following command.

conda activate transformer_learn

4. Install the necessary Package in your environment. Following are requirements for section 2. Install based on each section.

pip3 install transformers
pip3 install datasets
pip3 install git+https://github.com/huggingface/diffusers
pip3 install accelerate
pip3 install ftfy
pip3 install tensorboard
pip3 install Jinja2
pip install scikit-learn
pip install torch
pip install torchaudio

Speech Processing

The preparation of raw audio signals for machine learning tasks includes
several critical pre processing steps. These steps convert the raw audio data into a format suitable for training and inference with transformers. Here, are
the pre-processing steps for audio signals.

Pre-processing — A crucial step for most transformer models is

resampling. Transformers require an audio signal of a predefined
sample rate. For instance, Whisper needs a sampling rate of 16KHz.

Additional pre-processing, such as normalization or noise reduction can be done to ensure consistency and faster convergence.
Frame extraction — The audio signal is split into overlapping frames
of a fixed duration, typically between 20 to 40 milliseconds. Each
frame corresponds to a short segment of the audio waveform. A

standard principle is a 50% overlap, meaning that adjacent frames
share half of their samples. Overlapping ensures a smoother transition
between adjacent frames and reduces the impact of frame boundaries

on the extracted features.
Windowing — A windowing function, such as a hamming window is
applied to mitigate artifacts at the start and end of the frame by
reducing the amplitude of the signal at these points.
Feature extraction — A feature extraction technique is applied to each
frame. Common techniques include log-mel spectrograms, Mel-frequency cepstral coefficients or other time frequency
representations.
Sequence generation — The extracted frames are arranged into
sequences, each sequence representing a series of consecutive frames.
Padding — Padding is applied to the sequences to ensure that all

sequences have the same length.

Examples of Speech Processing

In this section, we will demo the speech pre-processing through

practical examples.

import torch
import torchaudio
from torchaudio.transforms import MFCC
from torchaudio.utils import download_asset

# Load the audio waveform
SAMPLE_SPEECH = download_asset("tutorial-
assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-
sg0042.wav")
waveform, sample_rate =
torchaudio.load(SAMPLE_SPEECH)

# Get the duration of the waveform
waveform_duration = waveform.numel() / sample_rate

print("Waveform duration:", waveform_duration, "seconds")

# Define the frame length and frame shift in seconds
frame_length = 0.025 # 25 milliseconds
frame_shift = 0.01 # 10 milliseconds

# Define the desired sequence length and number of MFCC coefficients
sequence_length = 40 
num_mfcc = 40

# Initialize the MFCC transform
mfcc_transform = MFCC(sample_rate=sample_rate, n_mfcc=num_mfcc, melkwargs={'hop_length': int(frame_shift * sample_rate)})

# Perform feature extraction
frames = torchaudio.transforms.Resample(sample_rate, 16000)(waveform)
print('number of frames',frames.shape[1])
print (frames.shape)
mfcc = mfcc_transform(frames)

# Perform padding if necessary
num_sequences = sequences.shape[2]
if num_sequences < sequence_length:
pad_frames = torch.zeros(mfcc.shape[0],
num_mfcc, sequence_length - num_sequences)
 sequences = torch.cat([sequences, pad_frames], dim=1)

# Print the shapes of the extracted features and sequences
print("MFCC shape:", mfcc.shape)
print("Sequences shape:", sequences.shape)

Output:

Waveform duration: 3.4 seconds
torch.Size([1, 54400])
MFCC shape: torch.Size([1, 40, 341])
Sequences shape: torch.Size([1, 1, 341, 40])

Analysis:

Let us delve into the details of the aforementioned code snippet to

understand its operation.

torch.Size([1, 54400]) — This represents the shape of the waveform
tensor which has a size of 1 along the first dimension batch
dimension and 54,400 along the second dimension number of
samples in the waveform.
MFCC shape torch.Size([1, 40, 341]) — This indicates the shape of
the MFCC tensor obtained from the feature extraction process. It has a
size of 1 along the first batch dimension, 40 along the
second dimension number of MFCC coefficients, and 341 along the
third dimension number of frames.
Sequences shape torch.Size([1, 1, 341, 40]) — This represents the
shape of the sequence tensor after unfolding the MFCC features.

The output shows that the waveform has a duration of 3.4

seconds, the MFCC features have a shape of (1, 40, 341) and the

sequences have a shape of (1, 1, 341, 40).

Conclusion

Finally, we navigate the rapidly advancing landscape of speech processing, it’s clear that the potential of this technology is only just beginning to be realized. From understanding the foundational system requirements to delving into the intricacies of speech processing, we see a future where machines not only comprehend our words but also grasp the nuances of our speech with unparalleled accuracy. The examples of speech pre-processing we have explored highlight the transformative power of this technology, enabling more intuitive and human like interactions. As we continue to refine these processes, the boundary between human and machine communication will blur even further, ushering in an era where our voices become the ultimate interface with technology. This journey into the heart of speech processing is not just about understanding, it’s about shaping the future of how we connect with the digital world.