Revolutionizing Speech Recognition

The wav2vec breakthrough

5 min readAug 12, 2024

Imagine transforming raw, unstructured audio into precise actionable insights with just a few lines of code, this is the revolution that Wav2Vec brings to the world of speech processing. It’s not just a leap forward but it’s a complete redefinition of what’s possible in understanding and analyzing human language, making the once complex task of speech recognition not only accessible but incredibly powerful.

Table of Content

Wav2vec
Application of wav2vec
SpeechT5
Input/output representation

Wav2Vec

Wav2Vec2
is a self-supervised learning framework used for speech-processing tasks. The model undergoes two main stages, pre-training and
fine-tuning. In pre-training, the model is trained on a large amount of
unlabeled audio data. Figure 1.1 illustrates the architecture of Wav2Vec. Let

us explore its components.

Pre-processing of Raw Audio

The raw audio is divided into short segments called context

windows, typically spanning a few seconds (~25 sec).
Within each context window, the audio is further divided into smaller
chunks known as input sequences, usually a few milliseconds long.
A feature extractor is applied to each input sequence, transforming the
audio into a fixed-dimensional representation that captures important
spectral and temporal information.

Encoder

The encoder comprises multiple blocks each consisting of a
Convolutional Neural Network followed by layer

normalization and the GELU activation function.
The GELU activation function smoothens the transition for negative
values, addressing the dying ReLU problem and ensuring better
gradient flow during training.
The CNN processes the input sequences, extracting low-level acoustic
features.

Quantization Module

For self-supervised pre-training, the output of the feature encoder is
discretized into a finite set of speech representations using product

quantization.
Contextualized representation with a transformer, Relative

positional encoding information is added to the quantized feature

representation.
The quantized features are then passed through a transformer, which
generates contextualized representations.

Pre-training

During pre-training, Wav2Vec employs self-supervised learning. The
model is trained to predict masked or corrupted speech representations within each context window. This is very similar to
BERT pre-training.
As shown in Figure 1.1, 50% of the latent representations are masked
before feeding to the transformer.
By reconstructing the masked or corrupted parts, the model learns to
capture important speech features without explicit labels.
The loss function used in pre-training involves comparing the
predicted representations with the original unmasked or uncorrupted
representations.

Fine Tuning

After pre-training, Wav2Vec can be fine-tuned on specific

downstream tasks, such as speech recognition or speaker

identification.
Fine-tuning involves training the model on labeled data specific to the
target task, enabling it to adapt to the task’s requirements.

Applications of Wav2vec

Wav2Vec has found successful applications in various speech processing
tasks, including speech recognition, speaker identification, speech synthesis,

and keyword spotting.

Speech T5

SpeechT53 is an adaptation of the T5 architecture, adapted for speech focused tasks, encompassing ASR, text-to-speech synthesis, language

comprehension among others. The architecture of SpeechT5 is elucidated in
Figure 1.3 and Figure 1.4.

Depicted in Figure 1.3 is the encoder-decoder structure of the model, which
is composed of six modal-specific pre/post components. Let us dig deeper

into these components.

Input/Output Representation

In SpeechT5, the problem is framed as converting speech/text into
speech/text.

Text Pre/Post-net — Here, we divide the text into units known as
tokens, which are typically characters. When the tokens enter the
system or the pre-net, they are transformed into embedding vectors.
Later, the post-net takes these vectors and calculates the probability of

each token being the right output based on the learned information.
Speech Pre/Post-net — For handling speech data, the system uses a
component from Wav2Vec 2.0, known as a CNN feature extractor, as
the encoder pre-net. This helps break down the speech into a more
understandable format for the system. The decoder pre-net uses a
feature of the audio input known as a Log-melfilter bank. This decoder
pre-net comprises three fully connected layers followed bythe RELU

activation function. It also incorporates speaker embedding, which is a
way of differentiating between different speakers’ voices.
Finally, the decoder post-net does two things :

It predicts the processed sounds of the output referred to as the log
Melfilterbank.
It converts the processed data decoder output into a single

number, known as a scalar.

This scalar helps determine when to conclude the processing often referred
as predicting the stop token.

Conclusion

Wav2Vec has revolutionized the way we approach speech recognition, turning raw audio into meaningful data with unparalleled accuracy. Its applications stretch across industries, powering everything from voice activated assistants to real-time translation services, proving that the future of communication is here. Coupled with the flexibility of Speech T5 and its powerful input/output representations, we are witnessing the dawn of a new era in speech processing. This convergence of technologies is not just advancing how machines understand us, it’s redefining our interaction with technology itself, paving the way for more intuitive, human-like communication across every facet of our lives.