Leopard Speech-to-Text

Offline speech-to-text for confidential meetings, interviews, and analysis

Convert audio and video files to text on-device with speaker diarization, word-level confidence, and timestamps. 12× more efficient than Whisper at the same accuracy — higher accuracy with custom vocabulary for names, jargon, and product terms.

0.026
Core Hour Ratio
vs. 0.32 Whisper Base and 0.16 Whisper Tiny
9.7%
English WER
vs. 9.5% Whisper Base and 12.7% Whisper Tiny
14.5%
German WER
vs. 23.6% Whisper Base and 13.4% Google STT
What is Leopard Speech-to-Text?

Only on-device speech-to-text available across every platform

Leopard Speech-to-Text converts audio and video recordings into accurate transcripts with embedded speaker diarization, custom vocabulary, word-level timestamps, word-level confidence scores, and automatic punctuation — without sending a single byte of audio to any server. It is the only production-ready speech-to-text SDK that runs genuinely on-device across every major platform a product ships on.

Cloud STT APIs deliver strong accuracy but require audio to leave the device on every inference. On-premise containerised alternatives run in Docker on multi-core CPU or GPU servers, which are not feasible for laptop, web, mobile, or embedded deployment. Leopard is built end-to-end by Picovoice with a proprietary training framework and inference engine. No Whisper, PyTorch, ONNX, or TensorFlow dependencies. Leopard runs on standard CPU hardware at compute costs that make embedded and mobile deployment economically viable, while delivering the enterprise feature set of a cloud transcription API, including speaker diarization, custom vocabulary, timestamps, confidence, and punctuation, in a single on-device SDK.

Leopard Speech-to-Text is well-suited to transcribe the recordings of board meetings, interviews, internal strategy discussions, patient consultations, and legal depositions, in which conversations include confidential and personally identifiable information, as well as industry or company-specific jargon. Accurate, structured, speaker-labelled transcripts with domain-specific vocabulary, produced on the device where the recording is stored.

Developer Experience

Add on-device speech-to-text in minutes

Leopard Speech-to-Text takes an audio file and returns a transcript with word-level timestamps, confidence scores, automatic punctuation and truecasing, and optional speaker labels in one SDK call. No cloud authentication, no container to provision. Use Leopard Speech-to-Text with its native SDKs for Android, C, .NET, Flutter, iOS, Java, NodeJS, Python, React, React Native, and Web.

Open-Source Speech-to-Text Benchmark

Comparable or better accuracy than Whisper Base, 12x more efficient

Leopard Speech-to-Text is benchmarked against Amazon, Azure, Google, IBM Watson, and every Whisper size across six languages on Word Error Rate and Core-Hour. Leopard outperforms Whisper Base and Whisper Tiny in every non-English language it supports, French, German, Spanish, Italian, and Portuguese, despite being 12x more efficient than Whisper Base and 6x more efficient than Whisper Tiny.

English WER
Lower is better
Amazon Transcribe4.3%
Azure STT5.5%
Whisper Large5.7%
Spanish WER
Lower is better
Amazon Transcribe5.3%
Whisper Large5.5%
Whisper Medium6.9%
German WER
Lower is better
Whisper Large7.4%
Amazon7.6%
Azure8.5%
Core-Hour Ratio
Lower is better
Picovoice Leopard0.026
Whisper Tiny0.16
Whisper Base0.32
Ready to integrate? Check our docs to start building or talk to the sales team about enterprise deployment.
Capabilities

Why enterprises choose Leopard Speech-to-Text

Leopard Speech-to-Text delivers the feature set of an enterprise cloud transcription API — speaker diarization, custom vocabulary, word-level timestamps and confidence, automatic punctuation — on-device across every platform. In eight languages, benchmarked and tuned for production, closing the accuracy gap to cloud APIs in non-English to under 3 points on most languages.

01Custom vocabularyLeopard Speech-to-Text offers custom vocabulary through the self-service developer console. Product teams can add domain-specific terms, product names, medical terminology, legal phrases, or proper nouns, and deploy in minutes. No labelled audio data or machine learning expertise required.
02Keyword boostingBeyond adding vocabulary, product teams increase the probability weight of specific keywords that appear frequently in conversations. Sales call analytics platforms hear "renewal", "proposal", and "churn" frequently and don't want to miss them. Boosting them reduces misrecognition for exactly the terms that matter most.
03Programmatic custom vocabulary APIThe Leopard Speech-to-Text Custom Model API lets both developers and end users add custom vocabulary and boost keywords from any device via the cloud API. This allows product teams to build a custom vocabulary UI inside their own product, route it to the Picovoice backend via API, and deploy model updates instantaneously without a Console visit.
04Dedicated Model TrainingPicovoice researchers optimize transcription models further for any domain, hardware, acoustic environment, speaker accent, or noise condition through Non-Recurring Engineering (NRE) engagements.
05Efficient - 12x less compute than WhisperLeopard Speech-to-Text processes 1 hour of audio using 0.026 core-hours — 6× less than Whisper Tiny (0.16), 12× less than Whisper Base (0.32), 38× less than Whisper Small (0.99), and 58× less than Whisper Medium (1.52). Built end-to-end on a proprietary training framework and inference engine with no open-source dependency, Leopard Speech-to-Text makes real on-device deployment possible without sacrificing accuracy. Leopard runs on the CPU already in a phone, a Raspberry Pi, or a laptop, without offloading compute to a cloud or a dedicated accelerator.
06CPU-onlyLeopard Speech-to-Text can run on CPU-only with no GPU, no neural accelerator, or special runtime requirement. Whisper Small and larger Whisper versions require GPU acceleration for production throughput. Some lean on special hardware accelerators (NPU), GPU, or containers for scale. Leopard runs on standard CPU hardware, laptops, Android phones, iOS devices, Raspberry Pi, and server-class Intel, AMD, or ARM with no GPU or dedicated AI accelerator required. For embedded products, battery-powered wearables, and cost-sensitive deployments, this is the difference between a viable product and one that requires custom silicon.
07Embedded Speaker DiarizationLeopard Speech-to-Text includes embedded Falcon Speaker Diarization, which can be enabled with a simple configuration: enable_diarization = True. Leopard returns a speaker-labelled transcript in the same output — start time, end time, speaker tag, and text per segment. For meetings, interviews, depositions, and contact centre recordings, one SDK call produces a transcript ready for analytics, CRM ingestion, or LLM summarization.
08Word-Level TimestampsAlong with the transcript, Leopard Speech-to-Text returns start and end time for each transcribed word, not just per segment. This enables captioning and subtitles, searchable audio archives, highlight generation, and precision editing. One transcript serves search, editing, and playback without post-processing.
09Word-Level Confidence ScoresEvery transcribed word ships with a confidence score between 0 and 1, showing Leopard Speech-to-Text's confidence in the accuracy of the transcribed word. This allows product teams to send low-confidence words to human review, highlight them in the UI, or route them to secondary validation, without re-auditing entire transcripts manually. Turns a raw transcript into a triaged document for contact centre QA, legal transcription, and medical documentation.
10Automatic Punctuation and TruecasingLeopard Speech-to-Text automatically applies commas, periods, question marks, and proper capitalisation based on prosody and sentence structure without requiring post-processing or a separate punctuation model. Leopard returns transcripts that are immediately readable and ingestible by downstream NLP and LLM pipelines, and suitable for human-facing displays.
11Non-English AccuracyLeopard is within 1.1 points of Google STT in German and within 3 points in French and Spanish. Against every Whisper size that realistically runs on-device, Leopard leads across all five non-English languages. For multilingual products that cannot send audio to the cloud — EU healthcare, legal across Europe and Latin America, government — Leopard delivers cloud-adjacent accuracy on the device with minimal compute requirements.
12Cross-PlatformLeopard Speech-to-Text runs on every platform your product ships — Android, Chrome, Edge, Firefox, iOS, Linux, macOS, Raspberry Pi, Safari, and Windows — across AMD, Intel, NVIDIA, and Qualcomm hardware.
13Compliance by architectureCloud STT APIs achieve HIPAA and GDPR compliance through Business Associate Agreements and contractual controls. Leopard Speech-to-Text is compliant by architecture: audio never leaves the device, so there is no data to regulate in transit or at rest outside the customer's infrastructure. For healthcare practices transcribing patient consultations, legal teams processing deposition recordings, defence applications handling classified audio, and investment firms recording confidential conversations, on-device processing is the only correct architecture that mitigates the risks.
14Offline ProcessingLeopard Speech-to-Text processes data offline without sending any audio data to remote servers, allowing it to operate in air-gapped environments, remote field deployments, aircraft, vessels, classified networks, rural clinics, and any infrastructure where cloud APIs cannot reach or where data handling requirements prohibit audio transmission to third-party servers. The transcription quality is identical whether the device has an internet connection or not.
15Enterprise ReadyLeopard Speech-to-Text is production-grade and enterprise-ready. Picovoice offers flexible licensing, dedicated engineering support, NDA-protected custom model training, and SLA-backed response times for teams shipping at scale.

Ship it.
On device.

On-device speech-to-text with cloud-level features. No cloud. No compromises.

FAQ

Common questions about speech-to-text

+
What is speech-to-text?

Speech-to-text, also known as automatic speech recognition (ASR) and open-domain large vocabulary speech recognition (LVSR), refers to the technology that converts spoken audio into written text. It is the core building block of transcription, voice assistants, meeting notes, contact centre analytics, and any application that needs to understand or record what was said.

Leopard Speech-to-Text is Picovoice's on-device batch transcription engine: it processes completed audio and video recordings and returns accurate transcripts with embedded speaker diarization, custom vocabulary, word-level timestamps, and confidence scores, and automatic punctuation

+
What is the difference between on-device, on-premise, and cloud speech-to-text?

The three terms are often used interchangeably, but they mean different things. Cloud speech-to-text sends audio over the internet to a vendor's servers for processing. On-premise speech-to-text runs on servers inside the customer's infrastructure, usually via Docker containers. On-device speech-to-text runs on the device where the voice originates, a phone, a laptop, a Raspberry Pi, or a web browser, without a dedicated server. On-device speech-to-text can run on-prem and in the cloud, but not the other way around.

Leopard Speech-to-Text is genuinely on-device and is the only commercial SDK that runs on any platform, including cloud, on-prem, mobile, embedded, and web.

+
What are the benefits of on-device speech-to-text over cloud APIs?

On-device speech-to-text eliminates four problems with cloud APIs: privacy exposure (audio leaves the device on every inference), cost at scale, network dependency (cloud APIs cannot operate offline, in air-gapped environments, or in poor-connectivity deployments), and round-trip latency (network plus cloud processing adds 200–500ms before any transcription returns). For regulated industries, embedded products, mobile applications, and any workload where audio cannot leave the device, on-device is the only viable architecture.

+
How does Leopard Speech-to-Text compare to Whisper?

OpenAI Whisper is the most widely used open-source speech-to-text model. Leopard achieves 9.7% WER on English compared to Whisper Small at 7.0% and Whisper Medium at 6.1%, but requires dramatically less compute. Leopard's CPU requirement is 0.026 core-hour versus 0.99 for Whisper Small (38× gap), and 1.52 for Whisper Medium (58× gap). Leopard achieves comparable accuracy in English against Whisper Base (9.7% WER vs. 9.5%) using 12x less resources (0.026 core-hour vs. 0.32). Despite requiring 6x more resources than Leopard, Whisper Tiny makes 31% more errors than Leopard (9.7% WER vs. 12.7% WER).

On non-English languages, Leopard outperforms Whisper Base across all five benchmarked languages. Unlike Whisper, Leopard is built end-to-end on a proprietary inference engine with no PyTorch, ONNX, or TensorFlow runtime dependency, which is what enables deployment on mobile and embedded hardware where Whisper Small and larger cannot realistically run.

These base accuracy results can be improved by customizing Leopard for the target domain. Adding new vocabulary or boosting keywords with Leopard is as simple as typing the word, without requiring any coding experience, let alone machine learning expertise. Furthermore, Leopard Speech-to-Text comes with native SDKs for cross-platform deployment and enterprise support.

+
How does Leopard Speech-to-Text compare to Amazon Transcribe?

Amazon Transcribe achieves 4.3% WER on English — better than Leopard's 9.7% — but requires audio to be sent to AWS servers, with per-minute billing, regional availability constraints, and no offline operation.

For low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Amazon Transcribe is a better choice given its accuracy.

For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Amazon Transcribe.

+
How does Leopard Speech-to-Text compare to Azure Speech-to-Text?

Azure Speech-to-Text achieves 5.5% WER on English and is available either as a cloud API or as a Docker container for on-premise deployment. Both options require server-grade infrastructure and are cloud-dependent for licensing, even in the container case.

For on-prem deployments, low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Azure Speech-to-Text is a better choice given its accuracy.

For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Azure Speech-to-Text.

+
How does Leopard Speech-to-Text compare to Google Speech-to-Text?

Google Speech-to-Text achieves 8.9% WER on English — within 1 point of Leopard's 9.7%. On German, Leopard is within 1.1 points of Google (14.5% vs 13.4%). On French and Spanish, the gap is under 3.1 points. Google's API is cloud-only and requires audio transmission to Google's servers with per-minute billing.

For low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Google Speech-to-Text is a better choice given its accuracy and flexibility.

For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Google Speech-to-Text.

+
How do I choose the best speech-to-text for my project?

The best or right speech-to-text engine depends on three architectural questions.

  1. Where does the audio need to be processed?
    1. If it cannot leave the device (regulated industries, embedded products, air-gapped environments), the answer is on-device.
    2. If it can run on customer-owned servers but not in vendor cloud, the answer is on-prem using cloud containers or on-device STT.
    3. If audio can flow to the vendor cloud freely, cloud APIs lead in raw accuracy.
  2. What platforms does the product ship on? Cloud APIs work on any platform with internet. On-prem requires a server. On-device requires an SDK that supports your target platforms — Leopard is the only option that covers desktop, mobile, embedded, and web with a single SDK.
  3. What's the total cost of ownership over the product's lifetime? Cloud APIs charge per minute; on-device SDKs can be free to use but costly to maintain, as in the case of Whisper, or offer flexibility at scale, as in the case of Leopard. At scale, the math shifts decisively toward on-device.

Learn more about selecting the best speech-to-text.

+
Is Leopard Speech-to-Text GDPR, HIPAA, or CJIS compliant?

Yes, Leopard Speech-to-Text processes audio entirely on-device and never transmits it to any server. Leopard Speech-to-Text is compliant with GDPR, HIPAA, CCPA, CJIS, and SOC 2 by architecture, not policy. Cloud speech-to-text providers achieve regulatory compliance through contractual controls and Business Associate Agreements; Leopard achieves it architecturally. There is no audio to regulate in transit or at rest outside the customer's infrastructure, because no audio ever leaves it. Picovoice cannot access end-user audio.

+
Does Leopard Speech-to-Text support real-time transcription?

Leopard Speech-to-Text doesn't, but Cheetah Streaming Speech-to-Text does. Cheetah is Picovoice's on-device streaming speech-to-text engine that provides text output in real time.

+
Does Leopard Speech-to-Text support speaker diarization?

Yes. Leopard Speech-to-Text offers optimized Falcon Speaker Diarization embedded. Once enabled with a single line configuration, Leopard returns a speaker-labelled transcript in the same output — start time, end time, speaker tag, and transcribed text per segment. No separate engine to integrate, no pipeline to architect. For real-time streaming speaker diarization, Bluebird Streaming Speaker Diarization pairs with Cheetah Streaming Speech-to-Text.

Check Leopard Speech-to-Text documentation for more information.

+
Does Leopard Speech-to-Text support custom vocabulary?

Yes. Leopard supports domain-specific custom vocabulary through the Picovoice Console, allowing developers to add product names, medical terminology, legal phrases, technical jargon, or proper nouns that general-purpose models routinely misrecognise by simply typing these words and phrases, without labelled audio data.

For selected enterprise customers, Leopard Speech-to-Text API, which lets both developers and end users add custom vocabulary and boost keywords from any device via cloud API, without visiting Picovoice Console, and custom models via professional services are available.

+
Does Leopard Speech-to-Text return word-level timestamps and confidence scores?

Yes. Along with the transcript, Leopard Speech-to-Text returns metadata for each transcribed word that includes:

  • Start Time: Indicates when the word started in the transcribed audio. Value is in seconds.
  • End Time: Indicates when the word ended in the transcribed audio. Value is in seconds.
  • Confidence: Leopard Speech-to-Text's confidence that the transcribed word is accurate. It is a number within [0, 1].

Word-level timestamps enable media synchronisation for captioning, searchable audio archives, highlight generation, and precision editing. Confidence scores support human-in-the-loop review workflows — flagging low-confidence words for verification without manually re-auditing entire transcripts. Please visit the Leopard Speech-to-Text SDK documentation, such as the Leopard Speech-to-Text Python SDK, to learn more.

+
Does Leopard Speech-to-Text perform automatic punctuation and truecasing?

Yes. Leopard applies automatic punctuation and truecasing to transcripts by default. Commas, periods, question marks, and proper capitalisation are inserted based on the audio's prosody and sentence structure — no post-processing required. Transcripts are immediately readable, immediately ingestible by downstream NLP and LLM pipelines, and immediately suitable for human-facing displays. The feature can be disabled in configuration if your workflow requires raw output.

+
Does Leopard Speech-to-Text work offline and in air-gapped environments?

Yes. Leopard Speech-to-Text processes all audio on-device with no network connection required. It operates in air-gapped environments, remote field deployments, aircraft, vessels, classified networks, rural clinics, and any infrastructure where cloud APIs cannot reach or where data handling requirements prohibit audio transmission. The transcription quality is identical whether the device has an internet connection or not.

+
Does Leopard Speech-to-Text require a GPU?

No. Leopard runs on standard CPU hardware — laptops, desktops, mobile devices, servers, and embedded platforms, including Raspberry Pi 3/4/5. No GPU, no dedicated AI accelerator, and no special runtime required. This is one of Leopard's core architectural differentiators against alternatives like Whisper Small and Medium, which require GPU for production throughput and Whisper-derivatives, which leans on the accelerated hardware, such as Apple Neural Engine.

+
Can I use Leopard Speech-to-Text in the cloud?

Yes. While Leopard is designed for on-device deployment, it can also run in private, public, or hybrid cloud environments. The deployment decision is yours — audio stays on whatever infrastructure you choose to run Leopard on, not on Picovoice's servers. Tutorials are available for serverless speech-to-text with AWS Lambda and transcription microservice with gRPC.

+
What audio formats does Leopard Speech-to-Text support?

Leopard supports seven audio file formats: 3gp (AMR), FLAC, MP3, MP4/m4a (AAC), Ogg, WAV, and WebM.

Please visit the Leopard Speech-to-Text documentation, such as the Leopard Speech-to-Text Python API, to learn more.

+
Which platforms does Leopard Speech-to-Text support?
+
Which languages does Leopard Speech-to-Text support?

Leopard supports eight production-grade languages: English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish. The open-source speech-to-text benchmark publishes accuracy figures for six of them (English, French, German, Italian, Portuguese, Spanish), and Leopard outperforms Whisper Base across every non-English language it supports.

+
What if I need a language Leopard doesn't currently support?

Contact sales to discuss your commercial requirements. Picovoice regularly trains new languages for enterprise customers with sufficient deployment scale.

+
What should I do if I need support for other languages?

Contact sales to tell us about your commercial endeavor and ask for speech-to-text langauge support.

+
How do I get technical support for Leopard Speech-to-Text?

Picovoice docs, blog, Medium posts, and GitHub are great resources to learn about voice AI, Picovoice technology, and how to start building transcription products. Enterprise customers get dedicated support specific to their applications from Picovoice Product & Engineering teams. Reach out to your Picovoice contact or contact sales to discuss support options.

+
How can I get informed about updates and upgrades?

Version changes appear in the and LinkedIn. Subscribing to GitHub is the best way to get notified of patch releases. If you enjoy building with Leopard Speech-to-Text, show it by giving a GitHub star!