🚀 Best-in-class Voice AI!
Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.
Start Free

Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR) transcribes audio and video files, enabling several applications. This tutorial shows how to add subtitles to any video using the Picovoice Leopard Speech-to-Text Python SDK.

Setup

Extract Audio

First, extract the audio from your video content. You can accomplish this using a tool such as FFmpeg. Leopard ASR engine supports almost any audio format including FLAC, MP3, MP4, m4a, Ogg, WAV, and WebM.

Install Speech Recognition SDK

Install Leopard STT Python SDK:

Log in to (Sign up for) Picovoice Console. It is free. Grab your AccessKey and initialize Leopard:

Implementation

Convert Voice to Text

Process the audio file:

Leopard returns the transcription as an str. It also returns a sequence of word-level metadata including timestamps and confidence. For example:

Convert to SRT Format

SRT (SubRip subtitle) is a file format for storing subtitles. The transcript is organized in sections and each section is accompanied by a start and end timecode. Below is a snippet of a given .srt file:

We need to break the transcription into sections. We use two criteria:

  • If there is a certain duration of silence between two words we consider it an endpoint (i.e. breakpoint). The user is most likely done talking and they or someone else will start a new sentence later.
  • If there is a certain number of words already in the section we should contain it. Otherwise, it makes the screen too crowded.

to_srt method below implements these two logics:

Once we get the content of an SRT file we can simply save it:

YouTube Demo

Let's create subtitles for a YouTube video. First, install PyTube:

Grab the URL for a video you wish to generate captions for (e.g. https://www.youtube.com/watch?v=L1CK9bE3H_s) and download it using PyTube:

Go to the previous section and generate subtitles for it.

Explore

Transcription Confidence

No speech recognition technology is 100% accurate. If you want to assure zero errors a manual check is required at the end. But this can take time. Leopard outputs word-level confidence metrics (i.e. a number within 0 and 1). An interesting avenue to explore is highlighting only low-confidence words for manual review to save time.

Custom Vocabulary and Boosting Keywords

Leopard allows adding custom vocabulary and also boosting keywords via Picovoice Console. We can use this capability to increase the accuracy of Leopard for certain jargon (e.g. when transcribing a React tutorial).

If you’re ready to start transcribing your videos, find this demo on GitHub or visit Leopard Speech-to-Text Python SDK Quick Start and start building for free!

Start Building