Revolutionizing Speech Recognition
Imagine transforming raw, unstructured audio into precise actionable insights with just a few lines of code, this is the revolution that Wav2Vec brings to the world of speech processing. It’s not just a leap forward but it’s a complete redefinition of what’s possible in understanding and analyzing human language, making the once complex task of speech recognition not only accessible but incredibly powerful.
Table of Content
- Wav2vec
- Application of wav2vec
- SpeechT5
- Input/output representation
Wav2Vec
Wav2Vec2
is a self-supervised learning framework used for speech-processing tasks. The model undergoes two main stages, pre-training and
fine-tuning. In pre-training, the model is trained on a large amount of
unlabeled audio data. Figure 1.1 illustrates the architecture of Wav2Vec. Let
us explore its components.
Pre-processing of Raw Audio
- The raw audio is divided into short segments called context
windows, typically spanning a few seconds (~25 sec). - Within each context window, the audio is further divided into smaller
chunks known as input sequences, usually a few milliseconds long. - A feature extractor is applied to each input sequence, transforming the
audio into a fixed-dimensional representation that captures important
spectral and temporal information.
Encoder
- The encoder comprises multiple blocks each consisting of a
Convolutional Neural Network followed by layer
normalization and the GELU activation function. - The GELU activation function smoothens the transition for negative
values, addressing the dying ReLU problem and ensuring better
gradient flow during training. - The CNN processes the input sequences, extracting low-level acoustic
features.
Quantization Module
- For self-supervised pre-training, the output of the feature encoder is
discretized into a finite set of speech representations using product
quantization. - Contextualized representation with a transformer, Relative
positional encoding information is added to the quantized feature
representation. - The quantized features are then passed through a transformer, which
generates contextualized representations.
Pre-training
- During pre-training, Wav2Vec employs self-supervised learning. The
model is trained to predict masked or corrupted speech representations within each context window. This is very similar to
BERT pre-training. - As shown in Figure 1.1, 50% of the latent representations are masked
before feeding to the transformer. - By reconstructing the masked or corrupted parts, the model learns to
capture important speech features without explicit labels. - The loss function used in pre-training involves comparing the
predicted representations with the original unmasked or uncorrupted
representations.
Fine Tuning
- After pre-training, Wav2Vec can be fine-tuned on specific
downstream tasks, such as speech recognition or speaker
identification. - Fine-tuning involves training the model on labeled data specific to the
target task, enabling it to adapt to the task’s requirements.
Applications of Wav2vec
Wav2Vec has found successful applications in various speech processing
tasks, including speech recognition, speaker identification, speech synthesis,
and keyword spotting.
Speech T5
SpeechT53 is an adaptation of the T5 architecture, adapted for speech focused tasks, encompassing ASR, text-to-speech synthesis, language
comprehension among others. The architecture of SpeechT5 is elucidated in
Figure 1.3 and Figure 1.4.
Depicted in Figure 1.3 is the encoder-decoder structure of the model, which
is composed of six modal-specific pre/post components. Let us dig deeper
into these components.
Input/Output Representation
In SpeechT5, the problem is framed as converting speech/text into
speech/text.
- Text Pre/Post-net — Here, we divide the text into units known as
tokens, which are typically characters. When the tokens enter the
system or the pre-net, they are transformed into embedding vectors.
Later, the post-net takes these vectors and calculates the probability of
each token being the right output based on the learned information. - Speech Pre/Post-net — For handling speech data, the system uses a
component from Wav2Vec 2.0, known as a CNN feature extractor, as
the encoder pre-net. This helps break down the speech into a more
understandable format for the system. The decoder pre-net uses a
feature of the audio input known as a Log-melfilter bank. This decoder
pre-net comprises three fully connected layers followed bythe RELU
activation function. It also incorporates speaker embedding, which is a
way of differentiating between different speakers’ voices. - Finally, the decoder post-net does two things :
- It predicts the processed sounds of the output referred to as the log
Melfilterbank. - It converts the processed data decoder output into a single
number, known as a scalar.
This scalar helps determine when to conclude the processing often referred
as predicting the stop token.
Conclusion
Wav2Vec has revolutionized the way we approach speech recognition, turning raw audio into meaningful data with unparalleled accuracy. Its applications stretch across industries, powering everything from voice activated assistants to real-time translation services, proving that the future of communication is here. Coupled with the flexibility of Speech T5 and its powerful input/output representations, we are witnessing the dawn of a new era in speech processing. This convergence of technologies is not just advancing how machines understand us, it’s redefining our interaction with technology itself, paving the way for more intuitive, human-like communication across every facet of our lives.