Create Subtitles for Videos with Python

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR) transcribes audio and video files, enabling several applications. This tutorial shows how to add subtitles to any video using the Picovoice Leopard Speech-to-Text Python SDK.

Setup

Extract Audio

First, extract the audio from your video content. You can accomplish this using a tool such as FFmpeg. Leopard ASR engine supports almost any audio format including FLAC, MP3, MP4, m4a, Ogg, WAV, and WebM.

Install Speech Recognition SDK

Install Leopard STT Python SDK:

pip install "pvleopard>=1.1"

import pvleopard

leopard = pvleopard.create(access_key=access_key)

Implementation

Convert Voice to Text

Process the audio file:

transcript, words = leopard.process_file(audio_path)

Leopard returns the transcription as an str. It also returns a sequence of word-level metadata including timestamps and confidence. For example:

[
    {
        "word": "it's",
        "start_sec": 8.58,
        "end_sec": 8.70,
        "confidence": 0.78
    },
    {
        "word": "important",
        "start_sec": 8.77,
        "end_sec": 9.12,
        "confidence": 0.99
    },
    {
        "word": "for",
        "start_sec": 9.15,
        "end_sec": 9.22,
        "confidence": 0.96
    },
    ...
]

Convert to SRT Format

SRT (SubRip subtitle) is a file format for storing subtitles. The transcript is organized in sections and each section is accompanied by a start and end timecode. Below is a snippet of a given .srt file:

0
00:00:08,576 --> 00:00:11,711
it's important for you to know how to mix your own colors to make your color

1
00:00:11,840 --> 00:00:16,351
palettes appear cohesive it is also cheaper than buying a tube of paint for every color

2
00:00:17,568 --> 00:00:21,600
here's how to paint a color wheel from scratch all you need to start are the

...

We need to break the transcription into sections. We use two criteria:

If there is a certain duration of silence between two words we consider it an endpoint (i.e. breakpoint). The user is most likely done talking and they or someone else will start a new sentence later.
If there is a certain number of words already in the section we should contain it. Otherwise, it makes the screen too crowded.

to_srt method below implements these two logics:

def second_to_timecode(x: float) -> str:
    hour, x = divmod(x, 3600)
    minute, x = divmod(x, 60)
    second, x = divmod(x, 1)
    millisecond = int(x * 1000.)

    return '%.2d:%.2d:%.2d,%.3d' % (hour, minute, second, millisecond)

def to_srt(
        words: Sequence[pvleopard.Leopard.Word],
        endpoint_sec: float = 1.,
        length_limit: Optional[int] = 16) -> str:
    def _helper(end: int) -> None:
        lines.append("%d" % section)
        lines.append(
            "%s --> %s" %
            (
                second_to_timecode(words[start].start_sec),
                second_to_timecode(words[end].end_sec)
            )
        )
        lines.append(' '.join(x.word for x in words[start:(end + 1)]))
        lines.append('')

    lines = list()
    section = 0
    start = 0
    for k in range(1, len(words)):
        if ((words[k].start_sec - words[k - 1].end_sec) >= endpoint_sec) or \
                (length_limit is not None and (k - start) >= length_limit):
            _helper(k - 1)
            start = k
            section += 1
    _helper(len(words) - 1)

    return '\n'.join(lines)

Once we get the content of an SRT file we can simply save it:

with open(subtitle_path, 'w') as f:
    f.write(to_srt(words))

YouTube Demo

Let's create subtitles for a YouTube video. First, install PyTube:

pip install pytube

Grab the URL for a video you wish to generate captions for (e.g. https://www.youtube.com/watch?v=L1CK9bE3H_s) and download it using PyTube:

from pytube import YouTube

youtube = YouTube(youtube_url)
audio_stream = youtube \
    .streams \
    .filter(only_audio=True, audio_codec='opus') \
    .order_by('bitrate') \
    .last()
audio_stream.download(
    output_path=os.path.dirname(audio_path),
    filename=os.path.basename(audio_path))

Go to the previous section and generate subtitles for it.

Explore

Transcription Confidence

No speech recognition technology is 100% accurate. If you want to assure zero errors a manual check is required at the end. But this can take time. Leopard outputs word-level confidence metrics (i.e. a number within 0 and 1). An interesting avenue to explore is highlighting only low-confidence words for manual review to save time.

Custom Vocabulary and Boosting Keywords

Leopard allows adding custom vocabulary and also boosting keywords via Picovoice Console. We can use this capability to increase the accuracy of Leopard for certain jargon (e.g. when transcribing a React tutorial).

If you’re ready to start transcribing your videos, find this demo on GitHub or visit Leopard Speech-to-Text Python SDK Quick Start and start building for free!

Start Building

How to Create Subtitles for any Video with Python