🤸 Custom Speech-to-Text
Fine-tune speech-to-text models in seconds.
Start Free

Transcribing audio involves more than just converting speech into text. It requires structure and context. Speaker labels distinguish who’s talking, timestamps enable easy navigation, punctuation ensures readability, and confidence scores validate accuracy. Without these elements, transcripts become unstructured blocks of text that are hard to interpret or analyze.

Picovoice delivers advanced on-device automatic speech recognition (ASR) that’s metadata-rich, low-latency, and fully private. Powered by Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text, Picovoice enables developers to generate accurate, structured transcriptions with speaker labels, timestamps, word confidence, capitalization, and punctuation without sending audio to the cloud.

Speaker Labels: "Who Spoke When"

Speech-to-text deals with "what is said." It converts speech into text without distinguishing speakers. Speaker labels identify "who spoke when" in a conversation through a process called speaker diarization. Instead of providing a single block of text, speaker labels break down multi-speaker audio into clearly attributed segments with tags like Speaker 1, Speaker 2, and so on. They turn raw transcripts into structured dialogue that is easy to follow and analyze.

For example, a raw speech-to-text transcript without speaker labels would look like:

"Thanks for joining today. Let's get started with the agenda."

With speaker labels:

Speaker 1: "Thanks for joining today."
Speaker 2: "Let's get started with the agenda."

Leopard Speech-to-Text offers an optimized Falcon Speaker Diarization integrated!

These speaker labels are especially valuable in real-world scenarios such as meeting notes, media transcription workflows, and internal documentation. They make transcripts easier to search, analyze, and understand.

Timestamps: Sync Transcripts with Time

Timestamps in speech-to-text mark the exact start and end of each spoken phrase, creating time-aligned transcriptions that bring structure and context to raw audio. The output may look like:

"[00:07:20,191 - 00:07:20,447] don't you?"

They make transcripts context-aware by enabling navigation to specific moments, syncing speech with visuals, or automatically generating captions. Time-aligned text simplifies editing, reviewing, and integrating transcriptions with other media.

Beyond navigation, timestamps enhance accessibility and content organization. They help creators align transcripts with videos or podcasts for subtitles, help educators tag lectures for topic search, and allow journalists to produce searchable interview archives. Together, these capabilities turn raw recordings into structured, reusable content.

Word Confidence: Confidence at the Word Level

Word confidence, also known as Word Confidence Estimation (WCE), indicates how certain the speech recognition model is about each transcribed word. ASR engines use prediction models that return outputs with probability scores between 0.0 (lowest confidence) and 1.0 (highest confidence). These per-word confidence scores are distinct from overall transcription accuracy, which is measured by Word Error Rate (WER).

Language learning apps such as Duolingo could serve as a use case for WCE. When a user pronounces "bad", speech-to-text may return "bad", "dad", and "bed" with different probabilities. Based on their probabilities, the app can provide a score and feedback to the user. Open-domain voice assistants such as Siri and Alexa also benefit from WCE. Voice assistants can be designed to prompt users with a question or alternatives when phrases are recognized with low confidence, instead of responding directly.

Capitalization and Punctuation: Make Transcripts Readable

Capitalization, also known as truecasing in Natural Language Processing (NLP), refers to restoring proper casing in speech-to-text transcriptions. Sentence case capitalization, capitalizing the first word of a sentence, and proper name capitalization are the most common uses of capitalization. Capitalization is an important feature as it makes the text output more readable for speech-to-text transcriptions. Truecasing improves not only the text rEaDaBILiTY for humans but also the quality of input for certain NLP cases which are otherwise considered too noisy. Along with capitalization, punctuation also contributes to the improved readability of machine-generated transcripts.

Start Building Offline ASR Pipelines with Speech-to-Text Features

Start building with Picovoice's on-device ASR engines, Leopard Speech-to-Text for audio recordings or Cheetah Speech-to-Text for real-time pipelines using your favourite SDK.

1o = pvleopard.create(access_key)
2
3transcript, words =
4 o.process_file(path)
1const o = new Leopard(accessKey)
2
3const { transcript, words } =
4 o.processFile(path)
1Leopard o = new Leopard.Builder()
2 .setAccessKey(accessKey)
3 .setModelPath(modelPath)
4 .build(appContext);
5
6LeopardTranscript r =
7 o.processFile(path);
1let o = Leopard(
2 accessKey: accessKey,
3 modelPath: modelPath)
4
5let r = o.processFile(path)
1Leopard o = new Leopard.Builder()
2 .setAccessKey(accessKey)
3 .build();
4
5LeopardTranscript r =
6 o.processFile(path);
1Leopard o =
2 Leopard.Create(accessKey);
3
4LeopardTranscript result =
5 o.ProcessFile(path);
1const {
2 result,
3 isLoaded,
4 error,
5 init,
6 processFile,
7 startRecording,
8 stopRecording,
9 isRecording,
10 recordingElapsedSec,
11 release,
12} = useLeopard();
13
14await init(
15 accessKey,
16 model
17);
18
19await processFile(audioFile);
20
21useEffect(() => {
22 if (result !== null) {
23 // Handle transcript
24 }
25}, [result])
1Leopard o = await Leopard.create(
2 accessKey,
3 modelPath);
4
5LeopardTranscript result =
6 await o.processFile(path);
1const o = await Leopard.create(
2 accessKey,
3 modelPath)
4
5const {transcript, words} =
6 await o.processFile(path)
1pv_leopard_t *leopard = NULL;
2pv_leopard_init(
3 access_key,
4 model_path,
5 enable_automatic_punctuation,
6 &leopard);
7
8char *transcript = NULL;
9int32_t num_words = 0;
10pv_word_t *words = NULL;
11pv_leopard_process_file(
12 leopard,
13 path,
14 &transcript,
15 &num_words,
16 &words);
1const leopard =
2 await LeopardWorker.
3 fromPublicDirectory(
4 accessKey,
5 modelPath
6 );
7
8const {
9 transcript,
10 words
11} =
12 await leopard.process(pcm);

Frequently Asked Questions

What languages are supported by Leopard and Cheetah Speech-to-Text?
Leopard Speech-to-Text supports eight languages: English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish with benchmarked accuracy comparable to leading cloud APIs. Cheetah Streaming Speech-to-Text currently supports six languages: English, French, German, Italian, Portuguese, and Spanish for real-time transcription.
Can I use custom vocabulary in speech-to-text transcriptions?
Depends on the engine. Both Leopard and Cheetah support custom vocabulary. Developers can add domain-specific terms or proper nouns to improve transcription accuracy on the Picovoice Console. You can also add “boost words” (i.e., words with raised recognition priority) to improve correctness on specialized terms or industry jargon. Custom vocabulary and boost word integration ensure accurate recognition for industry-specific language across meetings, documentation, and analytics pipelines.
How scalable is offline speech-to-text for large deployments?
Picovoice models are lightweight (under 40 MB) and optimized for parallel processing across servers, desktops, mobile devices, and embedded hardware. This makes offline transcription suitable for enterprise-scale workloads, from distributed IoT networks to centralized analytics systems.