Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR) transcribes audio and video files, enabling several applications. This tutorial shows how to add subtitles to any video using the Picovoice Leopard Speech-to-Text Python SDK.
Setup
Extract Audio
First, extract the audio from your video content. You can accomplish this using a tool such as
FFmpeg. Leopard ASR engine supports almost any audio format including FLAC
, MP3
,
MP4
, m4a
, Ogg
, WAV
, and WebM
.
Install Speech Recognition SDK
Install Leopard STT Python SDK:
Log in to (Sign up for) Picovoice Console. It is free. Grab your
AccessKey
and initialize Leopard:
Implementation
Convert Voice to Text
Process the audio file:
Leopard returns the transcription as an str
. It also returns a sequence of word-level metadata including timestamps and confidence. For example:
Convert to SRT Format
SRT (SubRip subtitle) is a file format for storing subtitles. The transcript is organized in
sections and each section is accompanied by a start and end timecode. Below is a snippet of a given .srt
file:
We need to break the transcription into sections. We use two criteria:
- If there is a certain duration of silence between two words we consider it an endpoint (i.e. breakpoint). The user is most likely done talking and they or someone else will start a new sentence later.
- If there is a certain number of words already in the section we should contain it. Otherwise, it makes the screen too crowded.
to_srt
method below implements these two logics:
Once we get the content of an SRT file we can simply save it:
YouTube Demo
Let's create subtitles for a YouTube video. First, install PyTube:
Grab the URL for a video you wish to generate captions for (e.g. https://www.youtube.com/watch?v=L1CK9bE3H_s
) and download it using PyTube:
Go to the previous section and generate subtitles for it.
Explore
Transcription Confidence
No speech recognition technology is 100% accurate. If you want to assure zero errors a manual check is required at the end. But this can take time. Leopard outputs word-level confidence metrics (i.e. a number within 0 and 1). An interesting avenue to explore is highlighting only low-confidence words for manual review to save time.
Custom Vocabulary and Boosting Keywords
Leopard allows adding custom vocabulary and also boosting keywords via Picovoice Console. We can use this capability to increase the accuracy of Leopard for certain jargon (e.g. when transcribing a React tutorial).
If you’re ready to start transcribing your videos, find this demo on GitHub or visit Leopard Speech-to-Text Python SDK Quick Start and start building for free!
Start Building