Learn how to add subtitles to any video using the Picovoice Leopard speech-to-text Python SDK.

Setup

Extract Audio

First, extract the audio from your video content. You can accomplish this using a tool such as FFmpeg . Leopard ASR engine supports almost any audio format including FLAC, MP3, MP4, m4a, Ogg, WAV, and WebM.

Install Speech Recognition SDK

Install Leopard STT Python SDK:

Log in to (Sign up for) Picovoice Console . It is free. Grab your AccessKey and initialize Leopard:

Implementation

Convert Voice to Text

Process the audio file:

Leopard returns the transcription as an str. It also returns a sequence of word-level metadata including timestamps and confidence. For example:

Convert to SRT Format

SRT (SubRip subtitle) is a file format for storing subtitles. The transcript is organized in sections and each section is accompanied by a start and end timecode. Below is a snippet of a given .srt file:

We need to break the transcription into sections. We use two criteria:

  • If there is a certain duration of silence between two words we consider it an endpoint (i.e. breakpoint). The user is most likely done talking and they or someone else will start a new sentence later.
  • If there is a certain number of words already in the section we should contain it. Otherwise, it makes the screen too crowded.

to_srt method below implements these two logics:

Once we get the content of an SRT file we can simply save it:

YouTube Demo

Let's create subtitles for a YouTube video. First, install PyTube :

Grab the URL for a video you wish to generate captions for (e.g. https://www.youtube.com/watch?v=L1CK9bE3H_s) and download it using PyTube:

Go to the previous section and generate subtitles for it.

Explore

Transcription Confidence

No speech recognition technology is 100% accurate. If you want to assure zero errors a manual check is required at the end. But this can take time. Leopard outputs word-level confidence metrics (i.e. a number within 0 and 1). An interesting avenue to explore is highlighting only low-confidence words for manual review to save time.

Custom Vocabulary and Boosting Keywords

Leopard allows adding custom vocabulary and also boosting keywords via Picovoice Console . We can use this capability to increase the accuracy of Leopard for certain jargon (e.g. when transcribing a React tutorial).

The code is available on GitHub .