Convert audio and video files to text on-device with speaker diarization, word-level confidence, and timestamps. 12× more efficient than Whisper at the same accuracy — higher accuracy with custom vocabulary for names, jargon, and product terms.
Leopard Speech-to-Text converts audio and video recordings into accurate transcripts with embedded speaker diarization, custom vocabulary, word-level timestamps, word-level confidence scores, and automatic punctuation — without sending a single byte of audio to any server. It is the only production-ready speech-to-text SDK that runs genuinely on-device across every major platform a product ships on.
Cloud STT APIs deliver strong accuracy but require audio to leave the device on every inference. On-premise containerised alternatives run in Docker on multi-core CPU or GPU servers, which are not feasible for laptop, web, mobile, or embedded deployment. Leopard is built end-to-end by Picovoice with a proprietary training framework and inference engine. No Whisper, PyTorch, ONNX, or TensorFlow dependencies. Leopard runs on standard CPU hardware at compute costs that make embedded and mobile deployment economically viable, while delivering the enterprise feature set of a cloud transcription API, including speaker diarization, custom vocabulary, timestamps, confidence, and punctuation, in a single on-device SDK.
Leopard Speech-to-Text is well-suited to transcribe the recordings of board meetings, interviews, internal strategy discussions, patient consultations, and legal depositions, in which conversations include confidential and personally identifiable information, as well as industry or company-specific jargon. Accurate, structured, speaker-labelled transcripts with domain-specific vocabulary, produced on the device where the recording is stored.
Leopard Speech-to-Text takes an audio file and returns a transcript with word-level timestamps, confidence scores, automatic punctuation and truecasing, and optional speaker labels in one SDK call. No cloud authentication, no container to provision. Use Leopard Speech-to-Text with its native SDKs for Android, C, .NET, Flutter, iOS, Java, NodeJS, Python, React, React Native, and Web.
Leopard Speech-to-Text is benchmarked against Amazon, Azure, Google, IBM Watson, and every Whisper size across six languages on Word Error Rate and Core-Hour. Leopard outperforms Whisper Base and Whisper Tiny in every non-English language it supports, French, German, Spanish, Italian, and Portuguese, despite being 12x more efficient than Whisper Base and 6x more efficient than Whisper Tiny.
Leopard Speech-to-Text delivers the feature set of an enterprise cloud transcription API — speaker diarization, custom vocabulary, word-level timestamps and confidence, automatic punctuation — on-device across every platform. In eight languages, benchmarked and tuned for production, closing the accuracy gap to cloud APIs in non-English to under 3 points on most languages.
enable_diarization = True. Leopard returns a speaker-labelled transcript in the same output — start time, end time, speaker tag, and text per segment. For meetings, interviews, depositions, and contact centre recordings, one SDK call produces a transcript ready for analytics, CRM ingestion, or LLM summarization.On-device speech-to-text with cloud-level features. No cloud. No compromises.
Speech-to-text, also known as automatic speech recognition (ASR) and open-domain large vocabulary speech recognition (LVSR), refers to the technology that converts spoken audio into written text. It is the core building block of transcription, voice assistants, meeting notes, contact centre analytics, and any application that needs to understand or record what was said.
Leopard Speech-to-Text is Picovoice's on-device batch transcription engine: it processes completed audio and video recordings and returns accurate transcripts with embedded speaker diarization, custom vocabulary, word-level timestamps, and confidence scores, and automatic punctuation
The three terms are often used interchangeably, but they mean different things. Cloud speech-to-text sends audio over the internet to a vendor's servers for processing. On-premise speech-to-text runs on servers inside the customer's infrastructure, usually via Docker containers. On-device speech-to-text runs on the device where the voice originates, a phone, a laptop, a Raspberry Pi, or a web browser, without a dedicated server. On-device speech-to-text can run on-prem and in the cloud, but not the other way around.
Leopard Speech-to-Text is genuinely on-device and is the only commercial SDK that runs on any platform, including cloud, on-prem, mobile, embedded, and web.
On-device speech-to-text eliminates four problems with cloud APIs: privacy exposure (audio leaves the device on every inference), cost at scale, network dependency (cloud APIs cannot operate offline, in air-gapped environments, or in poor-connectivity deployments), and round-trip latency (network plus cloud processing adds 200–500ms before any transcription returns). For regulated industries, embedded products, mobile applications, and any workload where audio cannot leave the device, on-device is the only viable architecture.
OpenAI Whisper is the most widely used open-source speech-to-text model. Leopard achieves 9.7% WER on English compared to Whisper Small at 7.0% and Whisper Medium at 6.1%, but requires dramatically less compute. Leopard's CPU requirement is 0.026 core-hour versus 0.99 for Whisper Small (38× gap), and 1.52 for Whisper Medium (58× gap). Leopard achieves comparable accuracy in English against Whisper Base (9.7% WER vs. 9.5%) using 12x less resources (0.026 core-hour vs. 0.32). Despite requiring 6x more resources than Leopard, Whisper Tiny makes 31% more errors than Leopard (9.7% WER vs. 12.7% WER).
On non-English languages, Leopard outperforms Whisper Base across all five benchmarked languages. Unlike Whisper, Leopard is built end-to-end on a proprietary inference engine with no PyTorch, ONNX, or TensorFlow runtime dependency, which is what enables deployment on mobile and embedded hardware where Whisper Small and larger cannot realistically run.
These base accuracy results can be improved by customizing Leopard for the target domain. Adding new vocabulary or boosting keywords with Leopard is as simple as typing the word, without requiring any coding experience, let alone machine learning expertise. Furthermore, Leopard Speech-to-Text comes with native SDKs for cross-platform deployment and enterprise support.
Amazon Transcribe achieves 4.3% WER on English — better than Leopard's 9.7% — but requires audio to be sent to AWS servers, with per-minute billing, regional availability constraints, and no offline operation.
For low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Amazon Transcribe is a better choice given its accuracy.
For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Amazon Transcribe.
Azure Speech-to-Text achieves 5.5% WER on English and is available either as a cloud API or as a Docker container for on-premise deployment. Both options require server-grade infrastructure and are cloud-dependent for licensing, even in the container case.
For on-prem deployments, low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Azure Speech-to-Text is a better choice given its accuracy.
For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Azure Speech-to-Text.
Google Speech-to-Text achieves 8.9% WER on English — within 1 point of Leopard's 9.7%. On German, Leopard is within 1.1 points of Google (14.5% vs 13.4%). On French and Spanish, the gap is under 3.1 points. Google's API is cloud-only and requires audio transmission to Google's servers with per-minute billing.
For low-volume applications, PoCs, hobby projects, and when compliance is not a constraint, Google Speech-to-Text is a better choice given its accuracy and flexibility.
For regulated industries, air-gapped environments, mobile applications, high-volume applications, and embedded deployments, Leopard becomes the enterprise's choice over Google Speech-to-Text.
The best or right speech-to-text engine depends on three architectural questions.
Learn more about selecting the best speech-to-text.
Yes, Leopard Speech-to-Text processes audio entirely on-device and never transmits it to any server. Leopard Speech-to-Text is compliant with GDPR, HIPAA, CCPA, CJIS, and SOC 2 by architecture, not policy. Cloud speech-to-text providers achieve regulatory compliance through contractual controls and Business Associate Agreements; Leopard achieves it architecturally. There is no audio to regulate in transit or at rest outside the customer's infrastructure, because no audio ever leaves it. Picovoice cannot access end-user audio.
Leopard Speech-to-Text doesn't, but Cheetah Streaming Speech-to-Text does. Cheetah is Picovoice's on-device streaming speech-to-text engine that provides text output in real time.
Yes. Leopard Speech-to-Text offers optimized Falcon Speaker Diarization embedded. Once enabled with a single line configuration, Leopard returns a speaker-labelled transcript in the same output — start time, end time, speaker tag, and transcribed text per segment. No separate engine to integrate, no pipeline to architect. For real-time streaming speaker diarization, Bluebird Streaming Speaker Diarization pairs with Cheetah Streaming Speech-to-Text.
Check Leopard Speech-to-Text documentation for more information.
Yes. Leopard supports domain-specific custom vocabulary through the Picovoice Console, allowing developers to add product names, medical terminology, legal phrases, technical jargon, or proper nouns that general-purpose models routinely misrecognise by simply typing these words and phrases, without labelled audio data.
For selected enterprise customers, Leopard Speech-to-Text API, which lets both developers and end users add custom vocabulary and boost keywords from any device via cloud API, without visiting Picovoice Console, and custom models via professional services are available.
Yes. Along with the transcript, Leopard Speech-to-Text returns metadata for each transcribed word that includes:
Word-level timestamps enable media synchronisation for captioning, searchable audio archives, highlight generation, and precision editing. Confidence scores support human-in-the-loop review workflows — flagging low-confidence words for verification without manually re-auditing entire transcripts. Please visit the Leopard Speech-to-Text SDK documentation, such as the Leopard Speech-to-Text Python SDK, to learn more.
Yes. Leopard applies automatic punctuation and truecasing to transcripts by default. Commas, periods, question marks, and proper capitalisation are inserted based on the audio's prosody and sentence structure — no post-processing required. Transcripts are immediately readable, immediately ingestible by downstream NLP and LLM pipelines, and immediately suitable for human-facing displays. The feature can be disabled in configuration if your workflow requires raw output.
Yes. Leopard Speech-to-Text processes all audio on-device with no network connection required. It operates in air-gapped environments, remote field deployments, aircraft, vessels, classified networks, rural clinics, and any infrastructure where cloud APIs cannot reach or where data handling requirements prohibit audio transmission. The transcription quality is identical whether the device has an internet connection or not.
No. Leopard runs on standard CPU hardware — laptops, desktops, mobile devices, servers, and embedded platforms, including Raspberry Pi 3/4/5. No GPU, no dedicated AI accelerator, and no special runtime required. This is one of Leopard's core architectural differentiators against alternatives like Whisper Small and Medium, which require GPU for production throughput and Whisper-derivatives, which leans on the accelerated hardware, such as Apple Neural Engine.
Yes. While Leopard is designed for on-device deployment, it can also run in private, public, or hybrid cloud environments. The deployment decision is yours — audio stays on whatever infrastructure you choose to run Leopard on, not on Picovoice's servers. Tutorials are available for serverless speech-to-text with AWS Lambda and transcription microservice with gRPC.
Leopard supports seven audio file formats: 3gp (AMR), FLAC, MP3, MP4/m4a (AAC), Ogg, WAV, and WebM.
Please visit the Leopard Speech-to-Text documentation, such as the Leopard Speech-to-Text Python API, to learn more.
Leopard supports eight production-grade languages: English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish. The open-source speech-to-text benchmark publishes accuracy figures for six of them (English, French, German, Italian, Portuguese, Spanish), and Leopard outperforms Whisper Base across every non-English language it supports.
Contact sales to discuss your commercial requirements. Picovoice regularly trains new languages for enterprise customers with sufficient deployment scale.
Contact sales to tell us about your commercial endeavor and ask for speech-to-text langauge support.
Picovoice docs, blog, Medium posts, and GitHub are great resources to learn about voice AI, Picovoice technology, and how to start building transcription products. Enterprise customers get dedicated support specific to their applications from Picovoice Product & Engineering teams. Reach out to your Picovoice contact or contact sales to discuss support options.