Question 1

What is speech-to-text?

Accepted Answer

Speech-to-text, also known as automatic speech recognition (ASR) and open-domain large vocabulary speech recognition (LVSR), refers to the technology that converts spoken audio into written text. It is the core building block of transcription, voice assistants, meeting notes, contact centre analytics, and any application that needs to understand or record what was said.

Leopard Speech-to-Text is Picovoice's on-device batch transcription engine: it processes completed audio and video recordings and returns accurate transcripts with embedded speaker diarization, custom vocabulary, word-level timestamps, and confidence scores, and automatic punctuation

Question 2

What is the difference between on-device, on-premise, and cloud speech-to-text?

Accepted Answer

The three terms are often used interchangeably, but they mean different things. Cloud speech-to-text sends audio over the internet to a vendor's servers for processing. On-premise speech-to-text runs on servers inside the customer's infrastructure, usually via Docker containers. On-device speech-to-text runs on the device where the voice originates, a phone, a laptop, a Raspberry Pi, or a web browser, without a dedicated server. On-device speech-to-text can run on-prem and in the cloud, but not the other way around.

Leopard Speech-to-Text is genuinely on-device and is the only commercial SDK that runs on any platform, including cloud, on-prem, mobile, embedded, and web.

Question 3

What are the benefits of on-device speech-to-text over cloud APIs?

Accepted Answer

On-device speech-to-text eliminates four problems with cloud APIs: privacy exposure (audio leaves the device on every inference), cost at scale, network dependency (cloud APIs cannot operate offline, in air-gapped environments, or in poor-connectivity deployments), and round-trip latency (network plus cloud processing adds 200–500ms before any transcription returns). For regulated industries, embedded products, mobile applications, and any workload where audio cannot leave the device, on-device is the only viable architecture.

Question 4

How does Leopard Speech-to-Text compare to Whisper?

Accepted Answer

OpenAI Whisper is the most widely used open-source speech-to-text model. Leopard achieves 9.7% WER on English compared to Whisper Small at 7.0% and Whisper Medium at 6.1%, but requires dramatically less compute. Leopard's CPU requirement is 0.026 core-hour versus 0.99 for Whisper Small (38× gap), and 1.52 for Whisper Medium (58× gap). Leopard achieves comparable accuracy in English against Whisper Base (9.7% WER vs. 9.5%) using 12x less resources (0.026 core-hour vs. 0.32). Despite requiring 6x more resources than Leopard, Whisper Tiny makes 31% more errors than Leopard (9.7% WER vs. 12.7% WER).

On non-English languages, Leopard outperforms Whisper Base across all five benchmarked languages. Unlike Whisper, Leopard is built end-to-end on a proprietary inference engine with no PyTorch, ONNX, or TensorFlow runtime dependency, which is what enables deployment on mobile and embedded hardware where Whisper Small and larger cannot realistically run.

These base accuracy results can be improved by customizing Leopard for the target domain. Adding new vocabulary or boosting keywords with Leopard is as simple as typing the word, without requiring any coding experience, let alone machine learning expertise. Furthermore, Leopard Speech-to-Text comes with native SDKs for cross-platform deployment and enterprise support.

Question 5

How does Leopard Speech-to-Text compare to Amazon Transcribe?

Accepted Answer

Amazon Transcribe achieves 4.3% WER on English — better than Leopard's 9.7% — but requires audio to be sent to AWS servers, with per-minute billing, regional availability constraints, and no offline operation.

For low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Amazon Transcribe is a better choice given its accuracy.

For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Amazon Transcribe.

Question 6

How does Leopard Speech-to-Text compare to Azure Speech-to-Text?

Accepted Answer

Azure Speech-to-Text achieves 5.5% WER on English and is available either as a cloud API or as a Docker container for on-premise deployment. Both options require server-grade infrastructure and are cloud-dependent for licensing, even in the container case.

For on-prem deployments, low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Azure Speech-to-Text is a better choice given its accuracy.

For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Azure Speech-to-Text.

Question 7

How does Leopard Speech-to-Text compare to Google Speech-to-Text?

Accepted Answer

Google Speech-to-Text achieves 8.9% WER on English — within 1 point of Leopard's 9.7%. On German, Leopard is within 1.1 points of Google (14.5% vs 13.4%). On French and Spanish, the gap is under 3.1 points. Google's API is cloud-only and requires audio transmission to Google's servers with per-minute billing.

For low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Google Speech-to-Text is a better choice given its accuracy and flexibility.

For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Google Speech-to-Text.

Question 8

How do I choose the best speech-to-text for my project?

Accepted Answer

The best or right speech-to-text engine depends on three architectural questions.

Where does the audio need to be processed?
1. If it cannot leave the device (regulated industries, embedded products, air-gapped environments), the answer is on-device.
2. If it can run on customer-owned servers but not in vendor cloud, the answer is on-prem using cloud containers or on-device STT.
3. If audio can flow to the vendor cloud freely, cloud APIs lead in raw accuracy.
What platforms does the product ship on? Cloud APIs work on any platform with internet. On-prem requires a server. On-device requires an SDK that supports your target platforms — Leopard is the only option that covers desktop, mobile, embedded, and web with a single SDK.
What's the total cost of ownership over the product's lifetime? Cloud APIs charge per minute; on-device SDKs can be free to use but costly to maintain, as in the case of Whisper, or offer flexibility at scale, as in the case of Leopard. At scale, the math shifts decisively toward on-device.

Learn more about selecting the best speech-to-text.

Question 9

Is Leopard Speech-to-Text GDPR, HIPAA, or CJIS compliant?

Accepted Answer

Yes, Leopard Speech-to-Text processes audio entirely on-device and never transmits it to any server. Leopard Speech-to-Text is compliant with GDPR, HIPAA, CCPA, CJIS, and SOC 2 by architecture, not policy. Cloud speech-to-text providers achieve regulatory compliance through contractual controls and Business Associate Agreements; Leopard achieves it architecturally. There is no audio to regulate in transit or at rest outside the customer's infrastructure, because no audio ever leaves it. Picovoice cannot access end-user audio.

Question 10

Does Leopard Speech-to-Text support real-time transcription?

Accepted Answer

Leopard Speech-to-Text doesn't, but Cheetah Streaming Speech-to-Text does. Cheetah is Picovoice's on-device streaming speech-to-text engine that provides text output in real time.

Question 11

Does Leopard Speech-to-Text support speaker diarization?

Accepted Answer

Yes. Leopard Speech-to-Text offers optimized Falcon Speaker Diarization embedded. Once enabled with a single line configuration, Leopard returns a speaker-labelled transcript in the same output — start time, end time, speaker tag, and transcribed text per segment. No separate engine to integrate, no pipeline to architect. For real-time streaming speaker diarization, Bluebird Streaming Speaker Diarization pairs with Cheetah Streaming Speech-to-Text.

Check Leopard Speech-to-Text documentation for more information.

Question 12

Does Leopard Speech-to-Text support custom vocabulary?

Accepted Answer

Yes. Leopard supports domain-specific custom vocabulary through the Picovoice Console, allowing developers to add product names, medical terminology, legal phrases, technical jargon, or proper nouns that general-purpose models routinely misrecognise by simply typing these words and phrases, without labelled audio data.

For selected enterprise customers, Leopard Speech-to-Text API, which lets both developers and end users add custom vocabulary and boost keywords from any device via cloud API, without visiting Picovoice Console, and custom models via professional services are available.

Question 13

Does Leopard Speech-to-Text return word-level timestamps and confidence scores?

Accepted Answer

Yes. Along with the transcript, Leopard Speech-to-Text returns metadata for each transcribed word that includes:

Start Time: Indicates when the word started in the transcribed audio. Value is in seconds.
End Time: Indicates when the word ended in the transcribed audio. Value is in seconds.
Confidence: Leopard Speech-to-Text's confidence that the transcribed word is accurate. It is a number within [0, 1].

Word-level timestamps enable media synchronisation for captioning, searchable audio archives, highlight generation, and precision editing. Confidence scores support human-in-the-loop review workflows — flagging low-confidence words for verification without manually re-auditing entire transcripts. Please visit the Leopard Speech-to-Text SDK documentation, such as the Leopard Speech-to-Text Python SDK, to learn more.

Question 14

Does Leopard Speech-to-Text perform automatic punctuation and truecasing?

Accepted Answer

Yes. Leopard applies automatic punctuation and truecasing to transcripts by default. Periods, and proper capitalisation are inserted based on the audio's prosody and sentence structure — no post-processing required. Transcripts are immediately readable, immediately ingestible by downstream NLP and LLM pipelines, and immediately suitable for human-facing displays. The feature can be disabled in configuration if your workflow requires raw output.

Question 15

Does Leopard Speech-to-Text work offline and in air-gapped environments?

Accepted Answer

Yes. Leopard Speech-to-Text processes all audio on-device with no network connection required. It operates in air-gapped environments, remote field deployments, aircraft, vessels, classified networks, rural clinics, and any infrastructure where cloud APIs cannot reach or where data handling requirements prohibit audio transmission. The transcription quality is identical whether the device has an internet connection or not.

Question 16

Does Leopard Speech-to-Text require a GPU?

Accepted Answer

No. Leopard runs on standard CPU hardware — laptops, desktops, mobile devices, servers, and embedded platforms, including Raspberry Pi 3/4/5. No GPU, no dedicated AI accelerator, and no special runtime required. This is one of Leopard's core architectural differentiators against alternatives like Whisper Small and Medium, which require GPU for production throughput and Whisper-derivatives, which leans on the accelerated hardware, such as Apple Neural Engine.

Question 17

Can I use Leopard Speech-to-Text in the cloud?

Accepted Answer

Yes. While Leopard is designed for on-device deployment, it can also run in private, public, or hybrid cloud environments. The deployment decision is yours — audio stays on whatever infrastructure you choose to run Leopard on, not on Picovoice's servers. Tutorials are available for serverless speech-to-text with AWS Lambda and transcription microservice with gRPC.

Question 18

What audio formats does Leopard Speech-to-Text support?

Accepted Answer

Leopard supports seven audio file formats: 3gp (AMR), FLAC, MP3, MP4/m4a (AAC), Ogg, WAV, and WebM.

Please visit the Leopard Speech-to-Text documentation, such as the Leopard Speech-to-Text Python API, to learn more.

Question 19

Which platforms does Leopard Speech-to-Text support?

Accepted Answer

Desktop and Servers: Linux, macOS, and WindowsWeb Browsers: Chrome, Safari, Edge, and FirefoxMobile Devices: Android and iOSSingle Board Computers: Raspberry Pi

Question 20

Which languages does Leopard Speech-to-Text support?

Accepted Answer

Leopard supports eight production-grade languages: English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish. The open-source speech-to-text benchmark publishes accuracy figures for six of them (English, French, German, Italian, Portuguese, Spanish), and Leopard outperforms Whisper Base across every non-English language it supports.

Question 21

What if I need a language Leopard doesn't currently support?

Accepted Answer

Contact sales to discuss your commercial requirements. Picovoice regularly trains new languages for enterprise customers with sufficient deployment scale.

Question 22

How do I get technical support for Leopard Speech-to-Text?

Accepted Answer

Picovoice docs, blog, Medium posts, and GitHub are great resources to learn about voice AI, Picovoice technology, and how to start building transcription products. Enterprise customers get dedicated support specific to their applications from Picovoice Product & Engineering teams. Reach out to your Picovoice contact or contact sales to discuss support options.

Question 23

How can I get informed about updates and upgrades?

Accepted Answer

Version changes appear in the  and LinkedIn. Subscribing to GitHub is the best way to get notified of patch releases. If you enjoy building with Leopard Speech-to-Text, show it by giving a GitHub star!

Offline speech-to-text for confidential meetings, interviews, and analysis

Only on-device speech-to-text available across every platform

Add on-device speech-to-text in minutes

Comparable or better accuracy than Whisper Base, 12x more efficient

Why enterprises choose Leopard Speech-to-Text

Ship it.
On device.

Common questions about speech-to-text