Power of Speech

The ultimate guide to speech-to-text transformation

8 min readAug 13, 2024

In a world driven by voice, the ability to convert speech into text with precision has become the linchpin of modern communication. Speech-to-text technology is no longer just a convenience, it’s a revolution that is transforming how we interact with our devices, conduct business and navigate the digital world. From real-time transcription to unlocking accessibility for millions, speech-to-text is breaking down barriers and creating new possibilities. Welcome to the frontier where spoken words effortlessly transform into written language powering the future of communication.

Table of Content

Speech to text
Text to speech
Audio to audio

Speech To Text

ASR stands as a crucial process in the realm of speech processing. A

comprehensive ASR system entails several components, including voice
activity detection, speaker diarization and inverse text normalization.
Historically, these tasks relied on an array of complex components, each
carrying out a specific function. However, the advent of Transformer
models such as Whisper has revolutionized this field. Whisper operates
directly on raw audio signals, effectively delivering high-performing ASR
outputs. In the subsequent section, we will embark on a project showcasing

this technology. We will record our own voice and utilize Whisper for

transcription, thereby demonstrating it’s practical application and
effectiveness.

Project — Custom Audio Transcription with ASR using Whisper

In this demonstration, we will illustrate how Whisper can be utilized to
transcribe any audio regardless of its source. We will specifically be
recording our own voices and then employ Whisper to carry out the
transcription process. This will give us an opportunity to see how this

powerful transformer model operates in a real-world application.

# importing necessary packages
import torch
from transformers import pipeline
from datasets import load_dataset
import torchaudio

Record Audio

We will use ipywebrtc library to record the audio. You could use any
Library or use external dedicated audio system like Mac’s QuickTime
Player to record high quality audio.

# Create camera stream
camera = CameraStream(constraints={'audio': True, 'video': False})
# Create audio recorder
recorder = AudioRecorder(stream=camera)
# Display recorder
display(recorder)

Save the Audio to Disk

TorchAudio works with a finite set of audio file formats, such as WAV,

MP3 and others. In this project, we will be converting the audio files into
the WAV format. However, if your audio is already in a format supported by
TorchAudio, you will not need to perform this step.

import ffmpeg
# Save the recording to a file
recorder.save('output.webm')
# Convert webm to wav
ffmpeg.input('output.webm').output('output.wav').run()

Pre-process the Audio

Whisper requires the audio signal to be in monochrome format and sampled
at 16kHz. Additionally, Hugging Face’s ASR pipeline expects the audio
signal to be in the form of a numpy array. This preprocessing step is crucial

to ensure accurate transcription.

import torchaudio
import torchaudio.transforms as T

waveform, sample_rate = torchaudio.load('output.wav')

# If audio is stereo, convert to mono
if waveform.shape[0] > 1:
 waveform = waveform.mean(dim=0)

# Resample the waveform to 16kHz
resampler = T.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)

# Squeeze the tensor to remove the channel dimension
waveform = waveform.squeeze()

# Convert tensor to numpy array
waveform_numpy = waveform.numpy()

Make the Prediction

In this project, we will utilize the Hugging Face pipeline to make

predictions using a pre-trained model. The pipeline feature in Hugging Face
provides a user friendly interface to work with pre-trained models. It

simplifies the process, especially for tasks that involve complex steps. Detailed information of pipelines available here: https://huggingface.co/docs/transformers/main/en/quicktour#pipeline.

pipe = pipeline("automatic-speech-recognition",
 model="openai/whisper-large-v2", chunk_length_s=30, device=device,)

prediction = pipe(waveform_numpy, batch_size=8)["text"]
print(prediction)

The code provides the end-to-end implementation
example.

Text To Speech

Let us now go over a text-to-speech project and we try to learn how we convert our text into speech.

Project — Implementing text-to-Speech

In this project, we introduce a personal touch to a text-to-speech system
using speaker embeddings. These embeddings act like voice fingerprints that capturing the unique aspects of our voice. Instead of utilizing a pre-existing

one, we record our own voice to create a custom voice fingerprint. This
personalized voice print is subsequently incorporated into our speech
generation system influencing the manner in which it converts written
words into spoken ones.
In the end, we subject our system to a test. We provide it with a piece of text

and allow it to perform its conversion magic, transforming that text into
speech.

# importing necessary packages
import os
import torch
from speechbrain.pretrained import EncoderClassifier
import torchaudio
import torchaudio.transforms as T
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech
from transformers import SpeechT5HifiGan

Declare Function For Creating Speaker Embedding

The microsoft/speecht5_tts model requires both text input and a speaker
embedding. The speaker embedding captures unique characteristics of
individual speakers, allowing downstream applications to recognize and

differentiate between speakers in different audio contexts. If you prefer to
use pre-built speaker embeddings based on various characteristics, you can

obtain them from the Matthijs/cmu-arctic-xvectors model.

However, in this example, we will record our own audio and

create our own speaker embedding using the speechbrain/spkrec-xvect-voxceleb model. The subsequent section will present a function to extract
the speaker embedding from the raw audio waveform.

model_name = "speechbrain/spkrec-xvect-voxceleb"

speaker_classifier = EncoderClassifier.from_hparams(source=model_name, run_opts={"device": device}, savedir=os.path.join("/tmp", model_name))

def compute_speaker_embedding(audio_data):
 with torch.no_grad():
 embeddings = speaker_classifier.encode_batch(torch.tensor(audio_data))
 embeddings = torch.nn.functional.normalize(embeddings, dim=2)
 embeddings = embeddings.squeeze().cpu().numpy()
 return embeddings

Perform Speaker Embedding

The file audio_sample2.wav contains a recording of my voice. You have the
option to record your own voice, which can be a few seconds long. The
subsequent code will pre-process the raw audio data and extract the speaker

embedding based on the provided audio data.

waveform, sample_rate = torchaudio.load('provide your audio file path here')

# If audio is stereo, convert to mono
if waveform.shape[0] > 1:
 waveform = waveform.mean(dim=0)

# Resample the waveform to 16kHz
resampler = T.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)

speaker_emb=compute_speaker_embedding(waveform)
speaker_emb=torch.tensor(speaker_emb).reshape(-1,512)

print(speaker_emb.shape)

Declare Model

Let us describe what these modes do for us.

Processor — The SpeechT5Processor is responsible for processing the

input text for the TTS system. It handles tasks such as tokenization,
encoding and preparing the input data for the TTS model.
Vocoder models are utilized to convert the synthesized speech into
the final waveform or audio signal. The SpeechT5HifiGan model
specifically employs the HiFi-GAN architecture, which is a high-fidelity generative adversarial network. This model enhances the
quality of the generated speech waveform, ensuring that the output is
clear, natural and pleasant to listen to.
The SpeechT5ForTextToSpeech model is the core component of the
TTS system. It takes the processed input from the processor, speaker
embedding, vcoder and performs the text-to-speech conversion.

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

Perform TTS

Lastly, we will carry out the TTS task and listen to the audio generated
based on the provided text and speaker embedding. As you listen, you will
observe that the speaker style closely resembles the characteristics of the
raw audio you previously recorded to create the speaker embedding. This
demonstrates the ability of the TTS system to replicate the desired speaker’s
voice and produce synthesized speech that aligns with the provided input.

inputs = processor(text="This is Harry. I live in New York City", return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"], speaker_emb, vocoder=vocoder)
from IPython.display import Audio
Audio(speech, rate=16000)

The above code provides the end-to-end implementation

example.

Audio To Audio

Audio-to-audio processing with transformers is an innovative approach to
handle various audio tasks like speech enhancement, source separation, music translation and even voice transformation. Audio-to-audio

processing can be thought of as a transformation function where the input
and output both are audio signals but with different characteristics. For
example, a noisy audio signal can be the input and the output would be a

denoised version of the same audio. Some applications of audio-to-audio
transformers are.

Speech enhancement — In this application, the transformer model
learns to filter out the noise and enhance the speech quality.
Source separation — Transformers can be used to separate different
audio sources in a mixed signal.
Music translation — Transformers can convert music from one style to

another, essentially learning the characteristics of different music
styles and applying them to input audio.
Voice transformation — In voice transformation, the transformer
model learns the unique features of a source and target voice. It then
takes an audio input in the source voice and transforms it to sound

like the target voice.

Project — Audio Quality Improvement Through Noise Reduction

In this code snippet, we are leveraging the power of the

SpeechBrain library to enhance audio quality. SpeechBrain, a versatile
Python library built on PyTorch, provides an array of pre-trained models
catering to a multitude of audio related tasks. These tasks encompass
speech recognition, speech diarization, speech enhancement among others.
Specifically, for our audio enhancement objective, we will employ the
speechbrain/metricgan-plus-voicebank model, renowned for its pre-trained capabilities.

Download the Noisy Audio

The audio clip has background noise that sounds like a busy city. In this bit of code, we will be getting this audio from GitHub and saving
it onto our computer.

import urllib.request

# URL of the audio file
url = "provide your audio file path here"

filename = "audio_noisy.wav"

# Download the file from `url` and save it locally
under filename:
urllib.request.urlretrieve(url, filename)

Load the Model and Pre-process the Audio Signals

In this code, we load the audio from a file, pre-process the audio

to make it single channel and 16kHz and finally normalize the audio.

import torch
import torchaudio
from speechbrain.pretrained import
SpectralMaskEnhancement
from IPython.display import Audio
# load the model
enhance_model = SpectralMaskEnhancement.from_hparams(
 source="speechbrain/metricgan-plus-voicebank",
 savedir="pretrained_models/metricgan-plus-
voicebank",)

#load the audio
waveform, sample_rate = torchaudio.load(filename)

# If your waveform is stereo (2 channels) you can

convert it to mono (1 channel) like this:
waveform = torch.mean(waveform, dim=0, keepdim=True)

# Usually, the SpeechBrain's pre-trained models
expect audio at 16kHz,
# so you might need to resample your audio if it's

not at 16kHz:
if sample_rate != 16000:
 resampler =
torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
 waveform = resampler(waveform)

# Now your waveform tensor is ready to be used with the enhancement model.
# But remember to normalize the audio data before
using it:
noisy = waveform / torch.max(torch.abs(waveform))
# Listen to the noisy audio
print("Noisy audio:")
display(Audio(noisy.squeeze().detach().numpy(), rate=16000))

Perform Voice Enhancement Removing the Noise

In the next bit of code, we are going to use our pre-trained model to
improve the sound of the audio. After we do that, we will save the cleaned-up audio and listen to it, to see how well our model did.

# Add relative length tensor
enhanced = enhance_model.enhance_batch(noisy, lengths=torch.tensor([1.]))

# Saving enhanced signal on disk
torchaudio.save('enhanced.wav', enhanced.cpu(), 16000)

# Load and listen to the enhanced audio
print("Enhanced audio:")
enhanced_audio = torchaudio.load('enhanced.wav')[0]
torchaudio.save('enhanced.wav', enhanced.cpu(),16000)
display(Audio(enhanced_audio.detach().numpy(), rate=16000))

Conclusion

This section examined key speech processing tasks and how transformer
models handle them. We dissected TTS, ASR and Audio-to-Audio conversion, focusing particularly on speech enhancement. These topics

were clarified through detailed examples.
We started by exploring the basics of speech processing and transformers.
Next, we delved into TTS, detailing its functioning and applications. This
was followed by a deep dive into ASR, underlining its role in converting

speech into written text. Finally, we ventured into the realm of Audio-to-Audio conversion and speech enhancement, revealing how they enhance
audio signal quality. This section demonstrated the versatility of transformers in managing
complex speech tasks, providing valuable knowledge for implementing
these models effectively. It showcased the promising future of transformers

in the sphere of audio and speech technology.