Learn how to add subtitles to any video using the Picovoice Leopard speech-to-text Python SDK.
First, extract the audio from your video content. You can accomplish this using a tool such as
FFmpeg. Leopard ASR engine supports almost any audio format including
Install Speech Recognition SDK
Install Leopard STT Python SDK:
Log in to (Sign up for) Picovoice Console. It is free. Grab your
AccessKey and initialize Leopard:
Convert Voice to Text
Process the audio file:
Leopard returns the transcription as an
str. It also returns a sequence of word-level metadata including timestamps and confidence. For example:
Convert to SRT Format
SRT (SubRip subtitle) is a file format for storing subtitles. The transcript is organized in
sections and each section is accompanied by a start and end timecode. Below is a snippet of a given
We need to break the transcription into sections. We use two criteria:
- If there is a certain duration of silence between two words we consider it an endpoint (i.e. breakpoint). The user is most likely done talking and they or someone else will start a new sentence later.
- If there is a certain number of words already in the section we should contain it. Otherwise, it makes the screen too crowded.
to_srt method below implements these two logics:
Once we get the content of an SRT file we can simply save it:
Let's create subtitles for a YouTube video. First, install PyTube:
Grab the URL for a video you wish to generate captions for (e.g.
https://www.youtube.com/watch?v=L1CK9bE3H_s) and download it using PyTube:
Go to the previous section and generate subtitles for it.
No speech recognition technology is 100% accurate. If you want to assure zero errors a manual check is required at the end. But this can take time. Leopard outputs word-level confidence metrics (i.e. a number within 0 and 1). An interesting avenue to explore is highlighting only low-confidence words for manual review to save time.
Custom Vocabulary and Boosting Keywords
Leopard allows adding custom vocabulary and also boosting keywords via Picovoice Console. We can use this capability to increase the accuracy of Leopard for certain jargon (e.g. when transcribing a React tutorial).
The code is available on GitHub.