Transcribing audio involves more than just converting speech into text. It requires structure and context. Speaker labels distinguish who’s talking, timestamps enable easy navigation, punctuation ensures readability, and confidence scores validate accuracy. Without these elements, transcripts become unstructured blocks of text that are hard to interpret or analyze.
Picovoice delivers advanced on-device automatic speech recognition (ASR) that’s metadata-rich, low-latency, and fully private. Powered by Leopard Speech-to-Text and Cheetah Streaming Speech-to-Text, Picovoice enables developers to generate accurate, structured transcriptions with speaker labels, timestamps, word confidence, capitalization, and punctuation without sending audio to the cloud.
Speaker Labels: "Who Spoke When"
Speech-to-text deals with "what is said." It converts speech into text without distinguishing speakers. Speaker labels identify "who spoke when" in a conversation through a process called speaker diarization. Instead of providing a single block of text, speaker labels break down multi-speaker audio into clearly attributed segments with tags like Speaker 1, Speaker 2, and so on. They turn raw transcripts into structured dialogue that is easy to follow and analyze.
For example, a raw speech-to-text transcript without speaker labels would look like:
With speaker labels:
Leopard Speech-to-Text offers an optimized Falcon Speaker Diarization integrated!
These speaker labels are especially valuable in real-world scenarios such as meeting notes, media transcription workflows, and internal documentation. They make transcripts easier to search, analyze, and understand.
Timestamps: Sync Transcripts with Time
Timestamps in speech-to-text mark the exact start and end of each spoken phrase, creating time-aligned transcriptions that bring structure and context to raw audio. The output may look like:
They make transcripts context-aware by enabling navigation to specific moments, syncing speech with visuals, or automatically generating captions. Time-aligned text simplifies editing, reviewing, and integrating transcriptions with other media.
Beyond navigation, timestamps enhance accessibility and content organization. They help creators align transcripts with videos or podcasts for subtitles, help educators tag lectures for topic search, and allow journalists to produce searchable interview archives. Together, these capabilities turn raw recordings into structured, reusable content.
Word Confidence: Confidence at the Word Level
Word confidence, also known as Word Confidence Estimation (WCE), indicates how certain the speech recognition model is about each transcribed word. ASR engines use prediction models that return outputs with probability scores between 0.0 (lowest confidence) and 1.0 (highest confidence). These per-word confidence scores are distinct from overall transcription accuracy, which is measured by Word Error Rate (WER).
Language learning apps such as Duolingo could serve as a use case for WCE. When a user pronounces "bad", speech-to-text may return "bad", "dad", and "bed" with different probabilities. Based on their probabilities, the app can provide a score and feedback to the user. Open-domain voice assistants such as Siri and Alexa also benefit from WCE. Voice assistants can be designed to prompt users with a question or alternatives when phrases are recognized with low confidence, instead of responding directly.
Capitalization and Punctuation: Make Transcripts Readable
Capitalization, also known as truecasing in Natural Language Processing (NLP), refers to restoring proper casing in speech-to-text transcriptions. Sentence case capitalization, capitalizing the first word of a sentence, and proper name capitalization are the most common uses of capitalization. Capitalization is an important feature as it makes the text output more readable for speech-to-text transcriptions. Truecasing improves not only the text rEaDaBILiTY for humans but also the quality of input for certain NLP cases which are otherwise considered too noisy. Along with capitalization, punctuation also contributes to the improved readability of machine-generated transcripts.
Start Building Offline ASR Pipelines with Speech-to-Text Features
Start building with Picovoice's on-device ASR engines, Leopard Speech-to-Text for audio recordings or Cheetah Speech-to-Text for real-time pipelines using your favourite SDK.
1o = pvleopard.create(access_key)2
3transcript, words =4 o.process_file(path)1const o = new Leopard(accessKey)2
3const { transcript, words } =4 o.processFile(path)1Leopard o = new Leopard.Builder()2 .setAccessKey(accessKey)3 .setModelPath(modelPath)4 .build(appContext);5
6LeopardTranscript r =7 o.processFile(path);1let o = Leopard(2 accessKey: accessKey,3 modelPath: modelPath)4
5let r = o.processFile(path)1Leopard o = new Leopard.Builder()2 .setAccessKey(accessKey)3 .build();4
5LeopardTranscript r =6 o.processFile(path);1Leopard o =2 Leopard.Create(accessKey);3
4LeopardTranscript result =5 o.ProcessFile(path);1const {2 result,3 isLoaded,4 error,5 init,6 processFile,7 startRecording,8 stopRecording,9 isRecording,10 recordingElapsedSec,11 release,12} = useLeopard();13
14await init(15 accessKey,16 model17);18
19await processFile(audioFile);20
21useEffect(() => {22 if (result !== null) {23 // Handle transcript24 }25}, [result])1Leopard o = await Leopard.create(2 accessKey,3 modelPath);4
5LeopardTranscript result =6 await o.processFile(path);1const o = await Leopard.create(2 accessKey,3 modelPath)4
5const {transcript, words} =6 await o.processFile(path)1pv_leopard_t *leopard = NULL;2pv_leopard_init(3 access_key,4 model_path,5 enable_automatic_punctuation,6 &leopard);7
8char *transcript = NULL;9int32_t num_words = 0;10pv_word_t *words = NULL;11pv_leopard_process_file(12 leopard,13 path,14 &transcript,15 &num_words,16 &words);1const leopard =2 await LeopardWorker.3 fromPublicDirectory(4 accessKey,5 modelPath6 );7
8const {9 transcript,10 words11} =12 await leopard.process(pcm);






