Discover an elegant speech-to-text solution in Python
Imagine you maintain a podcast platform and you want to transcribe the audio into text, your reflex would be to use an online service like AWS Transcribe? Or do you manage videos like Substack does and what to add subtitles to help disabled users follow these videos? Will you require an online service like AmberScript? What if I tell you it is easier today to build your solution with little resources…
In this article, I will present Whisper, a fantastic tool for transcribing audio files into text. It is an open-source project maintained by OpenAI, the company mostly known for the ChatGPT project. I will also show how to add subtitles to videos so fasten your seatbelts!
Installation
The first pre-requisite is to have the ffmpeg binary. It is the standard tool to play with audio/video files. The installation depends on your platform.
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
You will also need tools to build binary dependencies. On Linux/Unix systems, having gcc installed will probably do the job. On Windows, you will probably need to install Visual Studio with at least C/C++ support. I wrote an article on how to set up Python if you want more information.p
Now we can install Whisper. It works with Python 3.9 or higher.
$ pip install openai-whisper
# or with poetry
$ poetry add git+https://github.com/openai/whisper.git
If you don’t know poetry, I have a tutorial here.
For those who know poetry, you may wonder why I installed Whisper using the Git repository and not something like: poetry add openai-whisper
. This is because poetry tries to install the triton dependency on Windows (yes I am a Windows user) even if it is marked as only necessary for Unix/Linux systems. I don’t know why it can’t read the markers in the setup file.
Usage
Command Line Interface
The Whisper project comes with a command line of the same name. We will use it to get text from audio files. But, we need to find audio files to play with first 😆
My technique is to use the well-known Youtube platform. I find a video I want to use, for example, this song of one of my favorite soul singers, and download it using the you-get utility.
$ you-get 'https://somelink.com'
But we want an audio file as input, not a video file… This is where the utility ffmpeg comes in handy. It allows us to extract audio from a video. Here is the command I use for this purpose (thanks ChatGPT for the help!):
$ ffmpeg -i input_video.mp4 -vn -ab 192k -ar 48000 -y output_audio.mp3
input_video.mp4
is the input file where you want to extract the audio and output_audio.mp3
is the resulting audio file that will be created.
The Whisper command is pretty straightforward to use. You need to pass the audio file as only input argument and you are done.
# replace audio.mp3 with an audio file of your choice
$ whisper audio.mp3
# result
100%|███████████████████████████████████████| 461M/461M [00:19<00:00, 24.9MiB/s]
C:\Users\rolla\AppData\Local\pypoetry\Cache\virtualenvs\whisper-GlA7XE2g-py3.11\Lib\site-packages\whisper\transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:29.000] I let you settle down that you found a girl and you're married now
[00:29.000 --> 00:43.000] That's the old dreams came true, guess she gave you things I didn't give to you
[00:43.000 --> 00:53.000] Oh friend, why are you so shy? Ain't like you to hold back
[00:53.000 --> 01:00.000] Oh hide from the light, I hate to turn up out of the blue uninvited
[01:00.000 --> 01:06.000] But I couldn't stay away, I couldn't fight it, I had hoped you'd see my face
...
If it is the first time you use Whisper, you will see that it downloads something. It is a trained model used to analyze the audio file. By default, the base
model is used if no one is specified. There are different types of models available with different precisions, you can see the full list here. To change to another more robust model like medium
, you can type:
$ whisper audio.mp3 --model medium
Note: Running on GPU machines will make the processing even faster.
I have a warning appearing in my output about a Torch float type being misused, this is because the Whisper command has an option fp16
set to true by default, or this does not work on CPU architectures. If you encounter this warning, you can disable it by passing the option —fp16 False
.
The next line is more interesting:
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
We can waste up to 30 seconds for language detection, so it is a good idea to always set the language if you know it in advance 😉
$ whisper audio.mp3 --language en
# or
$ whisper audio.mp3 --language english
As you see, we can set the language using the ISO 3166-1 alpha-2 norm to specify a language or the full name. The accurate list can be found in this file.
You will also notice different files in your current directory: audio.json, audio.tsv, audio.json, audio.vtt, and audio.srt. These are the different output formats supported by Whisper. The JSON and TSV formats are simpler to manipulate since almost all programming languages have tools to handle them. TSV is just a specification of CSV using tabulation instead of commas for those who are wondering. The content is simple to understand in the case of the TSV, we have two time indications in milliseconds, and the message detected within this time frame. If you don’t want all these files, you can specify which format you want like the following:
# here we only want to transcribe text in json format
$ whisper audio.mp3 -f json
Unfortunately, we cannot pass multiple output formats in the command line 🥲(at the moment of writing).
Note that you can process multiple files at the same time, they will be transcribed one after another.
# You can use multiple audio formats as long as they are supported by ffmpeg
$ whisper audio1.mp3 audio2.wav
You can also translate a text to English (and only English at the moment of writing) from another language. For example, I use this random speech from a Spanish politician for testing.
$ whisper audio.mp3 --language es --task translate
It is really impressive the precision it has in my humble opinion.
Python code
If you don’t want to use the command line interface and want more flexibility, you can also use the Whisper API. In the following snippet, the content of the JSON file is printed.
from pprint import pprint
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
pprint(result, indent=4)
If you want to write files, there are Whisper utilities you can use for that:
from pathlib import Path
import whisper
from whisper.utils import WriteTXT, WriteVTT, WriteSRT, WriteJSON, WriteTSV
model = whisper.load_model('small')
audio_file = 'audio.mp3'
result = model.transcribe(audio_file)
# if you want to print details like in the terminal write this instead
# result = model.transcribe(audio_file, verbose=True)
current_dir = Path(__file__).parent.absolute()
print('write txt file')
WriteTXT(str(current_dir))(result, audio_file)
print('write vtt file')
WriteVTT(str(current_dir))(result, audio_file)
print('write srt file')
WriteSRT(str(current_dir))(result, audio_file)
print('write json file')
WriteJSON(str(current_dir))(result, audio_file)
print('write tsv file')
WriteTSV(str(current_dir))(result, audio_file)
Get text from a video and create a new one with subtitles
This is the most interesting part of this article. 😁
Before showing you the code, let’s decrypt the different actions we need to take:
Create a temporary directory to hold files during the video processing.
Extract the audio from a video (we already know how to do that) and save it in the temporary directory.
Get the transcribed text from the audio as we just saw above, and save the SRT (a simpler version of the VTT format) file only as it is the one of interest in our case.
Use ffmpeg to create a new video file with subtitles using the original video file and the VTT file we just created.
Delete the temporary directory with the useless files.
First, we will create a helper function to run the ffmpeg command.
import subprocess
def run_command(command_list: list[str]) -> str:
try:
result = subprocess.run(command_list, check=True, text=True, capture_output=True)
return result.stdout
except subprocess.CalledProcessError as e:
print(f"The command '{command_list}' failed with error:\n{str(e)}")
The subprocess.run api is simple to follow. We pass the command as a list of arguments and we check for errors. If everything is ok, we return the output.
Now, we can create a helper function to extract an audio file from a video file.
def extract_audio_from_video(video_path: str, temp_dir: str) -> Path:
directory = Path(temp_dir)
audio_file = directory / 'audio.wav'
command = ['ffmpeg', '-i', video_path, '-vn', '-ab', '192k', '-ar', '48000', '-y', str(audio_file)]
print(run_command(command))
return audio_file
Now, let’s tackle the function to get the transcribed text from an audio file.
def get_text_from_audio(
audio_file: Path,
temp_dir: str,
model_name: Literal['tiny', 'base', 'small', 'medium', 'large'] = 'small',
verbose: bool | None = None
) -> Path:
model = load_model(model_name)
result = model.transcribe(str(audio_file), verbose=verbose)
# write SRT file
text_file = (Path(temp_dir) / 'text.srt')
WriteSRT(temp_dir).write_result(result, text_file.open('w', encoding='utf-8'))
return text_file
Now, the function to create a video with subtitles.
def create_video_with_subtitles(
video_path: str, srt_file: Path, output_dir: str | None = None, output_file: str = 'output.mp4'
) -> None:
current_dir = Path(__file__).parent
output_dir = current_dir if output_dir is None else Path(output_dir)
output_file = output_dir / Path(output_file).name
copy_srt_file = Path(shutil.copy(srt_file, current_dir))
command = ['ffmpeg', '-i', video_path, '-vf', f'subtitles={copy_srt_file.name}', '-y', str(output_file)]
print(run_command(command))
copy_srt_file.unlink()
Ok, we may have noticed that I copied the SRT from the temporary directory to put it in the current directory, used it in the command, and deleted it afterward. For an unknown reason, ffmpeg tries to remove backslashes in the path and I end up with an error. I don’t know if this happens only on Windows or not 💁🏾♂️. If you have the answer tell me in the comments. So this is why I use this hack.
Let’s create a main function to wrap all the logic.
def main(video_path: str) -> None:
with tempfile.TemporaryDirectory() as temp_dir:
audio_file = extract_audio_from_video(video_path, temp_dir)
text_file = get_text_from_audio(audio_file, temp_dir)
create_video_with_subtitles(video_path, text_file)
And to run it, we can use this setup.
if __name__ == '__main__':
import sys
main(sys.argv[1])
The final file looks like this.
import shutil
import subprocess
import tempfile
from pathlib import Path
from typing import Literal
from whisper import load_model
from whisper.utils import WriteSRT
def run_command(command_list: list[str]) -> str:
try:
result = subprocess.run(command_list, check=True, text=True, capture_output=True)
return result.stdout
except subprocess.CalledProcessError as e:
print(e.stdout)
def extract_audio_from_video(video_path: str, temp_dir: str) -> Path:
directory = Path(temp_dir)
audio_file = directory / 'audio.wav'
command = ['ffmpeg', '-i', video_path, '-vn', '-ab', '192k', '-ar', '48000', '-y', str(audio_file)]
print(run_command(command))
return audio_file
def get_text_from_audio(
audio_file: Path,
temp_dir: str,
model_name: Literal['tiny', 'base', 'small', 'medium', 'large'] = 'small',
verbose: bool | None = None
) -> Path:
model = load_model(model_name)
result = model.transcribe(str(audio_file), verbose=verbose)
# write SRT file
text_file = (Path(temp_dir) / 'text.srt')
WriteSRT(temp_dir).write_result(result, text_file.open('w', encoding='utf-8'))
return text_file
def create_video_with_subtitles(
video_path: str, srt_file: Path, output_dir: str | None = None, output_file: str = 'output.mp4'
) -> None:
current_dir = Path(__file__).parent
output_dir = current_dir if output_dir is None else Path(output_dir)
output_file = output_dir / Path(output_file).name
copy_srt_file = Path(shutil.copy(srt_file, current_dir))
command = ['ffmpeg', '-i', video_path, '-vf', f'subtitles={copy_srt_file.name}', '-y', str(output_file)]
print(run_command(command))
copy_srt_file.unlink()
def main(video_path: str) -> None:
with tempfile.TemporaryDirectory() as temp_dir:
audio_file = extract_audio_from_video(video_path, temp_dir)
text_file = get_text_from_audio(audio_file, temp_dir)
create_video_with_subtitles(video_path, text_file)
if __name__ == '__main__':
import sys
main(sys.argv[1])
To test it, you can run
# replace input_video.mp4 with a valid path to a video file
$ python subtitle.py input_video.mp4
You will see an output file called output.mp4
in your current directory. If you look at it, subtitles should be present in the result file.
The script doesn’t give the possibility to give a custom name to your output file but you have the base to implement it. I created a project with a command line interface that you can use if you want. 😁
Also, I want to thank the author of the project subtitle who gave me inspiration for this article.
This is all for this article, hope you enjoy reading it. Take care of yourself and see you soon. 😁