The launch of Leopard Speech-to-Text and Cheetah Speech-to-Text for streaming brought cloud-level automatic speech recognition (ASR) to local. Offline voice recognition has a unique advantage over cloud APIs. It eliminates the need for cloud processing, resulting in privacy, zero latency and 10x more affordability. Picovoice’s cats, Leopard and Cheetah, now offer more.

Today, Picovoice announces the public availability of new speech-to-text features - timestamps, word confidence, capitalization and punctuation, even for the Free Plan users.


Automatic Speech Recognition with timestamps refers to the technology that adds start and end time for each recognized word or phrase. The output may look like:

“[00:07:20,191 - 00:07:20,447] don't you?”

Timestamps are useful to go to the corresponding part in the original audio recording from the transcript or add subtitles.

Timestamps are widely used in speech analytics, media and entertainment industry applications, or transcribing conversations such as interviews, panel discussions and legal depositions. In speech analytics, while analyzing a long audio file, one may want to go to the beginning of a particular section. Timestamping transcriptions could be helpful while reviewing court transcripts or interviews. In media and entertainment, auto-generated subtitles with timestamps are used for news, podcasts, movies or even YouTube videos. They improve accessibility and make the content discoverable. Making the content discoverable even helps with boosting the search engine presence.

Word Confidence

Word confidence is also known as Word Confidence Estimation, WCE for short. At the core, voice recognition technology uses prediction models and returns the output with a certain probability. Stating that probability, i.e. confidence level, for each recognized word is called “word confidence.” The confidence level has to be between 0.0 and 1.0, 0.0 being the lowest and 1.0 being the highest. Word confidence is not related to accuracy. (WER is the most commonly used method for speech-to-text accuracy).

An app such as Duolingo could serve as an example use case for WCE. When a user pronounces “bad”, speech to text may return “bad”, “dad” and “bed” with different probabilities. Based on their probabilities, the app can provide a score and feedback to the user. Open-domain voice assistants such as Siri and Alexa also benefit from WCE. Voice assistants can be designed to prompt users with a question or alternatives when phrases are recognized with low confidence instead of responding directly.

Capitalization and Punctuation

Capitalization, also known as truecasing in Natural Language Processing (NLP), deals with capitalizing each word appropriately. Sentence case capitalization, capitalizing the first word of a sentence, and proper name capitalization are the most common uses of capitalization. AI-powered capitalization is an important feature for speech-to-text software as it makes the text output more readable. Truecasing improves not only the text rEaDaBILiTY for humans but also the quality of input for certain NLP cases which are otherwise considered too noisy. Along with capitalization, punctuation also contributes to the improved readability of machine transcribed transcripts.

What's next?

Start building with Leopard Speech-to-Text for recordings or Cheetah Speech-to-Text for real-time with the Free Plan and your favourite SDK.

o = pvleopard.create(access_key)
transcript, words =

Need More?

Picovoice heavily invests in R&D to offer state-of-the-art voice AI accessible and available to any developer. There will be more features publicly available. Free Plan users can request new features under Leopard and Cheetah GitHub repos and join the conversation. Enterprise Plan users get custom or early access to additional features as a part of their commercial engagement. If your use case requires more, contact Picovoice Sales to tell us about your commercial project.