Power of Speech
In a world driven by voice, the ability to convert speech into text with precision has become the linchpin of modern communication. Speech-to-text technology is no longer just a convenience, it’s a revolution that is transforming how we interact with our devices, conduct business and navigate the digital world. From real-time transcription to unlocking accessibility for millions, speech-to-text is breaking down barriers and creating new possibilities. Welcome to the frontier where spoken words effortlessly transform into written language powering the future of communication.
Table of Content
- Speech to text
- Text to speech
- Audio to audio
Speech To Text
ASR stands as a crucial process in the realm of speech processing. A
comprehensive ASR system entails several components, including voice
activity detection, speaker diarization and inverse text normalization.
Historically, these tasks relied on an array of complex components, each
carrying out a specific function. However, the advent of Transformer
models such as Whisper has revolutionized this field. Whisper operates
directly on raw audio signals, effectively delivering high-performing ASR
outputs. In the subsequent section, we will embark on a project showcasing
this technology. We will record our own voice and utilize Whisper for
transcription, thereby demonstrating it’s practical application and
effectiveness.
Project — Custom Audio Transcription with ASR using Whisper
In this demonstration, we will illustrate how Whisper can be utilized to
transcribe any audio regardless of its source. We will specifically be
recording our own voices and then employ Whisper to carry out the
transcription process. This will give us an opportunity to see how this
powerful transformer model operates in a real-world application.
# importing necessary packages
import torch
from transformers import pipeline
from datasets import load_dataset
import torchaudio
Record Audio
We will use ipywebrtc library to record the audio. You could use any
Library or use external dedicated audio system like Mac’s QuickTime
Player to record high quality audio.
# Create camera stream
camera = CameraStream(constraints={'audio': True, 'video': False})
# Create audio recorder
recorder = AudioRecorder(stream=camera)
# Display recorder
display(recorder)
Save the Audio to Disk
TorchAudio works with a finite set of audio file formats, such as WAV,
MP3 and others. In this project, we will be converting the audio files into
the WAV format. However, if your audio is already in a format supported by
TorchAudio, you will not need to perform this step.
import ffmpeg
# Save the recording to a file
recorder.save('output.webm')
# Convert webm to wav
ffmpeg.input('output.webm').output('output.wav').run()
Pre-process the Audio
Whisper requires the audio signal to be in monochrome format and sampled
at 16kHz. Additionally, Hugging Face’s ASR pipeline expects the audio
signal to be in the form of a numpy array. This preprocessing step is crucial
to ensure accurate transcription.
import torchaudio
import torchaudio.transforms as T
waveform, sample_rate = torchaudio.load('output.wav')
# If audio is stereo, convert to mono
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0)
# Resample the waveform to 16kHz
resampler = T.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
# Squeeze the tensor to remove the channel dimension
waveform = waveform.squeeze()
# Convert tensor to numpy array
waveform_numpy = waveform.numpy()
Make the Prediction
In this project, we will utilize the Hugging Face pipeline to make
predictions using a pre-trained model. The pipeline feature in Hugging Face
provides a user friendly interface to work with pre-trained models. It
simplifies the process, especially for tasks that involve complex steps. Detailed information of pipelines available here: https://huggingface.co/docs/transformers/main/en/quicktour#pipeline.
pipe = pipeline("automatic-speech-recognition",
model="openai/whisper-large-v2", chunk_length_s=30, device=device,)
prediction = pipe(waveform_numpy, batch_size=8)["text"]
print(prediction)
The code provides the end-to-end implementation
example.
Text To Speech
Let us now go over a text-to-speech project and we try to learn how we convert our text into speech.
Project — Implementing text-to-Speech
In this project, we introduce a personal touch to a text-to-speech system
using speaker embeddings. These embeddings act like voice fingerprints that capturing the unique aspects of our voice. Instead of utilizing a pre-existing
one, we record our own voice to create a custom voice fingerprint. This
personalized voice print is subsequently incorporated into our speech
generation system influencing the manner in which it converts written
words into spoken ones.
In the end, we subject our system to a test. We provide it with a piece of text
and allow it to perform its conversion magic, transforming that text into
speech.
# importing necessary packages
import os
import torch
from speechbrain.pretrained import EncoderClassifier
import torchaudio
import torchaudio.transforms as T
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech
from transformers import SpeechT5HifiGan
Declare Function For Creating Speaker Embedding
The microsoft/speecht5_tts model requires both text input and a speaker
embedding. The speaker embedding captures unique characteristics of
individual speakers, allowing downstream applications to recognize and
differentiate between speakers in different audio contexts. If you prefer to
use pre-built speaker embeddings based on various characteristics, you can
obtain them from the Matthijs/cmu-arctic-xvectors model.
However, in this example, we will record our own audio and
create our own speaker embedding using the speechbrain/spkrec-xvect-voxceleb model. The subsequent section will present a function to extract
the speaker embedding from the raw audio waveform.
model_name = "speechbrain/spkrec-xvect-voxceleb"
speaker_classifier = EncoderClassifier.from_hparams(source=model_name, run_opts={"device": device}, savedir=os.path.join("/tmp", model_name))
def compute_speaker_embedding(audio_data):
with torch.no_grad():
embeddings = speaker_classifier.encode_batch(torch.tensor(audio_data))
embeddings = torch.nn.functional.normalize(embeddings, dim=2)
embeddings = embeddings.squeeze().cpu().numpy()
return embeddings
Perform Speaker Embedding
The file audio_sample2.wav contains a recording of my voice. You have the
option to record your own voice, which can be a few seconds long. The
subsequent code will pre-process the raw audio data and extract the speaker
embedding based on the provided audio data.
waveform, sample_rate = torchaudio.load('provide your audio file path here')
# If audio is stereo, convert to mono
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0)
# Resample the waveform to 16kHz
resampler = T.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
speaker_emb=compute_speaker_embedding(waveform)
speaker_emb=torch.tensor(speaker_emb).reshape(-1,512)
print(speaker_emb.shape)
Declare Model
Let us describe what these modes do for us.
- Processor — The SpeechT5Processor is responsible for processing the
input text for the TTS system. It handles tasks such as tokenization,
encoding and preparing the input data for the TTS model. - Vocoder models are utilized to convert the synthesized speech into
the final waveform or audio signal. The SpeechT5HifiGan model
specifically employs the HiFi-GAN architecture, which is a high-fidelity generative adversarial network. This model enhances the
quality of the generated speech waveform, ensuring that the output is
clear, natural and pleasant to listen to. - The SpeechT5ForTextToSpeech model is the core component of the
TTS system. It takes the processed input from the processor, speaker
embedding, vcoder and performs the text-to-speech conversion.
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
Perform TTS
Lastly, we will carry out the TTS task and listen to the audio generated
based on the provided text and speaker embedding. As you listen, you will
observe that the speaker style closely resembles the characteristics of the
raw audio you previously recorded to create the speaker embedding. This
demonstrates the ability of the TTS system to replicate the desired speaker’s
voice and produce synthesized speech that aligns with the provided input.
inputs = processor(text="This is Harry. I live in New York City", return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"], speaker_emb, vocoder=vocoder)
from IPython.display import Audio
Audio(speech, rate=16000)
The above code provides the end-to-end implementation
example.
Audio To Audio
Audio-to-audio processing with transformers is an innovative approach to
handle various audio tasks like speech enhancement, source separation, music translation and even voice transformation. Audio-to-audio
processing can be thought of as a transformation function where the input
and output both are audio signals but with different characteristics. For
example, a noisy audio signal can be the input and the output would be a
denoised version of the same audio. Some applications of audio-to-audio
transformers are.
- Speech enhancement — In this application, the transformer model
learns to filter out the noise and enhance the speech quality. - Source separation — Transformers can be used to separate different
audio sources in a mixed signal. - Music translation — Transformers can convert music from one style to
another, essentially learning the characteristics of different music
styles and applying them to input audio. - Voice transformation — In voice transformation, the transformer
model learns the unique features of a source and target voice. It then
takes an audio input in the source voice and transforms it to sound
like the target voice.
Project — Audio Quality Improvement Through Noise Reduction
In this code snippet, we are leveraging the power of the
SpeechBrain library to enhance audio quality. SpeechBrain, a versatile
Python library built on PyTorch, provides an array of pre-trained models
catering to a multitude of audio related tasks. These tasks encompass
speech recognition, speech diarization, speech enhancement among others.
Specifically, for our audio enhancement objective, we will employ the
speechbrain/metricgan-plus-voicebank model, renowned for its pre-trained capabilities.
Download the Noisy Audio
The audio clip has background noise that sounds like a busy city. In this bit of code, we will be getting this audio from GitHub and saving
it onto our computer.
import urllib.request
# URL of the audio file
url = "provide your audio file path here"
filename = "audio_noisy.wav"
# Download the file from `url` and save it locally
under filename:
urllib.request.urlretrieve(url, filename)
Load the Model and Pre-process the Audio Signals
In this code, we load the audio from a file, pre-process the audio
to make it single channel and 16kHz and finally normalize the audio.
import torch
import torchaudio
from speechbrain.pretrained import
SpectralMaskEnhancement
from IPython.display import Audio
# load the model
enhance_model = SpectralMaskEnhancement.from_hparams(
source="speechbrain/metricgan-plus-voicebank",
savedir="pretrained_models/metricgan-plus-
voicebank",)
#load the audio
waveform, sample_rate = torchaudio.load(filename)
# If your waveform is stereo (2 channels) you can
convert it to mono (1 channel) like this:
waveform = torch.mean(waveform, dim=0, keepdim=True)
# Usually, the SpeechBrain's pre-trained models
expect audio at 16kHz,
# so you might need to resample your audio if it's
not at 16kHz:
if sample_rate != 16000:
resampler =
torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
# Now your waveform tensor is ready to be used with the enhancement model.
# But remember to normalize the audio data before
using it:
noisy = waveform / torch.max(torch.abs(waveform))
# Listen to the noisy audio
print("Noisy audio:")
display(Audio(noisy.squeeze().detach().numpy(), rate=16000))
Perform Voice Enhancement Removing the Noise
In the next bit of code, we are going to use our pre-trained model to
improve the sound of the audio. After we do that, we will save the cleaned-up audio and listen to it, to see how well our model did.
# Add relative length tensor
enhanced = enhance_model.enhance_batch(noisy, lengths=torch.tensor([1.]))
# Saving enhanced signal on disk
torchaudio.save('enhanced.wav', enhanced.cpu(), 16000)
# Load and listen to the enhanced audio
print("Enhanced audio:")
enhanced_audio = torchaudio.load('enhanced.wav')[0]
torchaudio.save('enhanced.wav', enhanced.cpu(),16000)
display(Audio(enhanced_audio.detach().numpy(), rate=16000))
Conclusion
This section examined key speech processing tasks and how transformer
models handle them. We dissected TTS, ASR and Audio-to-Audio conversion, focusing particularly on speech enhancement. These topics
were clarified through detailed examples.
We started by exploring the basics of speech processing and transformers.
Next, we delved into TTS, detailing its functioning and applications. This
was followed by a deep dive into ASR, underlining its role in converting
speech into written text. Finally, we ventured into the realm of Audio-to-Audio conversion and speech enhancement, revealing how they enhance
audio signal quality. This section demonstrated the versatility of transformers in managing
complex speech tasks, providing valuable knowledge for implementing
these models effectively. It showcased the promising future of transformers
in the sphere of audio and speech technology.