Learn how to add subtitles to any video using the Picovoice Leopard speech-to-text Python SDK.
First, extract the audio from your video content. You can accomplish this using a tool such as
FFmpeg . Leopard ASR engine supports almost any audio format including
Install Speech Recognition SDK
Install Leopard STT Python SDK:
Log in to (Sign up for) Picovoice Console . It is free. Grab your
AccessKey and initialize Leopard:
Convert Voice to Text
Process the audio file:
Leopard returns the transcription as an
str. It also returns a sequence of word-level metadata including timestamps and confidence. For example:
Convert to SRT Format
SRT (SubRip subtitle) is a file format for storing subtitles. The transcript is organized in
sections and each section is accompanied by a start and end timecode. Below is a snippet of a given
We need to break the transcription into sections. We use two criteria:
- If there is a certain duration of silence between two words we consider it an endpoint (i.e. breakpoint). The user is most likely done talking and they or someone else will start a new sentence later.
- If there is a certain number of words already in the section we should contain it. Otherwise, it makes the screen too crowded.
to_srt method below implements these two logics:
Once we get the content of an SRT file we can simply save it:
Let's create subtitles for a YouTube video. First, install PyTube :
Grab the URL for a video you wish to generate captions for (e.g.
https://www.youtube.com/watch?v=L1CK9bE3H_s) and download it using PyTube:
Go to the previous section and generate subtitles for it.
No speech recognition technology is 100% accurate. If you want to assure zero errors a manual check is required at the end. But this can take time. Leopard outputs word-level confidence metrics (i.e. a number within 0 and 1). An interesting avenue to explore is highlighting only low-confidence words for manual review to save time.
Custom Vocabulary and Boosting Keywords
Leopard allows adding custom vocabulary and also boosting keywords via Picovoice Console . We can use this capability to increase the accuracy of Leopard for certain jargon (e.g. when transcribing a React tutorial).
The code is available on GitHub .