picoInference On-Device AI Inference Engine

Cross-platform on-device AI inference for voice, language, and vision models

Purpose-built on-device AI Inference engine that efficiently runs voice, language, and vision models on embedded, mobile, web, and desktop applications, outperforming server-repurposed runtimes like PyTorch Mobile and ONNX.

What it is
picoInference is a purpose-built on-device AI inference engine that is optimized end-to-end for both the model and platforms it runs on, delivering low-latency, cross-platform performance from MCUs to browsers.
What it replaces
picoInference replaces generic on-device runtimes like PyTorch Mobile, TensorFlow Lite, and ONNX Runtime, removing the conversion, tuning, and per-platform glue code that those frameworks require.
What you get
A purpose-built inference runtime bundled with each Picovoice SDK on GitHub, NuGet, npm, PyPI, Maven, and CocoaPods. No model conversion. No cloud. No maintenance. No ML experience needed.
WHAT IS PICOINFERENCE ON-DEVICE AI INFERENCE ENGINE?

On-device AI inference engine for every AI model class and platform

Generic on-device inference engines like TensorFlow Lite, PyTorch Mobile, ExecuTorch, and ONNX Runtime are designed to execute any model. That generality has a cost: every deployment requires conversion, calibration, and per-platform tuning to extract real performance, and even then, the result is rarely as fast as a runtime built specifically for the model in question.

picoInference takes the opposite approach. Each Picovoice product ships with its own inference runtime, hand-optimized for that model architecture by the same team that designed the model. Cheetah Streaming Speech-to-Text has its runtime, so do Cobra Voice Activity Detection and Orca Streaming Text-to-Speech. Every runtime targets server, desktop, mobile, embedded, and web environments natively. No conversion step. The runtime is distributed inside the SDK.

On-Device AI InferenceAI Inference EngineEdge InferenceOn-Device Inference EngineEdge AI RuntimeCross-Platform AI InferenceReal-Time AI InferenceLow-Latency AI InferenceAI Inference OptimizationLLM Inference OptimizationSIMD-Optimized InferenceAir-Gapped AI InferenceWake Word InferenceVoice Activity Detection InferenceSpeaker Recognition InferenceVAD InferenceStreaming Speech-to-Text InferenceSpeech Recognition InferenceText-to-Speech InferenceTTS InferenceSpeaker Verification InferenceIntent Recognition InferenceVoice Command InferenceSpeaker Identification InferenceX-Bit LLM InferenceOn-Device LLM InferenceVision Model InferenceOCR InferenceTensorFlow Lite AlternativePyTorch Mobile AlternativeONNX Runtime Alternativellama.cpp AlternativeExecuTorch AlternativeLiteRT AlternativeCore ML AlternativeTensorRT AlternativeOn-Device AI InferenceAI Inference EngineEdge InferenceOn-Device Inference EngineEdge AI RuntimeCross-Platform AI InferenceReal-Time AI InferenceLow-Latency AI InferenceAI Inference OptimizationLLM Inference OptimizationSIMD-Optimized InferenceAir-Gapped AI InferenceWake Word InferenceVoice Activity Detection InferenceSpeaker Recognition InferenceVAD InferenceStreaming Speech-to-Text InferenceSpeech Recognition InferenceText-to-Speech InferenceTTS InferenceSpeaker Verification InferenceIntent Recognition InferenceVoice Command InferenceSpeaker Identification InferenceX-Bit LLM InferenceOn-Device LLM InferenceVision Model InferenceOCR InferenceTensorFlow Lite AlternativePyTorch Mobile AlternativeONNX Runtime Alternativellama.cpp AlternativeExecuTorch AlternativeLiteRT AlternativeCore ML AlternativeTensorRT Alternative
ON-DEVICE AI INFERENCE ARCHITECTURE

What makes picoInference different from other on-device inference engines

Generic runtimes have to support a wide range of operators, model architectures, and hardware targets. To extract real performance, deployments typically rely on a stack of per-platform delegates or execution providers, such as NNAPI, GPU, Hexagon, Core ML, and CUDA, each maintained alongside the model.

picoInference inverts that tradeoff. Parallelized kernels are written for each operator on each instruction set. The result is real-time AI inference with single-digit CPU utilization, the ability to run across platforms, including devices as small as a Cortex-M4 microcontroller, and consistent low-latency execution across every supported platform.

Why enterprises choose picoInference On-device Inference Engine

picoInference is the on-device AI inference engine chosen by enterprises that need real-time, low-latency model execution on every platform their product ships to. Each runtime is optimized for the model class and specific hardware platforms, resulting in single-digit CPU utilization on commodity hardware while running entirely on-device with no cloud dependency.

01Cross-Platform SupportpicoInference runs natively on Linux, Windows, macOS, Raspberry Pi, Android, and iOS, plus all major browsers (Chrome, Edge, Firefox, Safari) via WebAssembly. Hardware support spans ARM, AMD, Intel, NVIDIA, and Qualcomm chipsets, with CUDA, Metal, and WebGPU acceleration where available.
02Model Optimized RuntimeEvery Picovoice product ships with its own inference runtime, optimized for the model by the same team that builds the model. The runtime knows exactly what model it executes, so it can pre-compile the operator graph, fix memory layouts at build time, and skip dispatch decisions generic engines make at runtime.
03Lowest CPU UtilizationMost on-device runtimes, such as PyTorch and ONNX, are repurposed server runtimes. Although they're optimized for edge deployment, they come with extra compute overhead. picoInference is a purpose-built inference engine optimized for specific models and hardware.
Core Hour Ratio
Lower is better
Cheetah Streaming0.083x
Vosk Streaming Large0.12x
Whisper.cpp Streaming Base1.67x
Moonshine Streaming Medium3.36x

Translation Speed (words / sec)
Higher is better
Zebra (DE → EN)112
Opus (DE → EN)45
Zebra (EN → FR)105
Opus (EN → FR)41
Zebra (ES → IT)98
Opus (ES → IT)36

TTS CPU Utilization
Lower is better
Orca Streaming TTS0.16x
Pocket TTS0.37x
Piper TTS0.54x

Core Hour Ratior
lower is better
Falcon0.02x
pyannote4.42x
04End-to-End OptimizationModels, training pipelines powered by picoGym, compression algorithms powered by picoCompression, and inference runtimes are designed together at Picovoice. The benefits of each layer compound: smaller models that run fast even on resource-constrained hardware, with the highest accuracy in their class.
Word Accuracy vs Core Hour
Word Accuracy vs Core HourScatter chart of Word Error Rate (lower is better) versus Core-Hours with 8 data points.0.11.01010%15%20%25%real-time thresholdCore-HoursWord Error Rate (lower is better)More AccurateFasterPicovoice CheetahVosk SmallVosk LargeMoonshine TinyMoonshine SmallMoonshine MediumWhisper.cpp TinyWhisper.cpp Base
* Y-axis is logarithmic. Pink dashed line marks engines above it cannot process audio as fast as it arrives on a single core.

Model Size vs. Accuracy
Model Size vs. AccuracyScatter chart of Word Error Rate (lower is better) versus Model Size with 8 data points.10 MB100 MB1 GB10 GB10%15%20%25%Model SizeWord Error Rate (lower is better)More AccurateSmaller FootprintPicovoice CheetahVosk SmallVosk LargeMoonshine TinyMoonshine SmallMoonshine MediumWhisper.cpp TinyWhisper.cpp Base
* Y-axis is logarithmic. Visual distance between buckets is uniform; absolute values are not proportional.
05Single-File SDK IntegrationEach Picovoice product SDK ships with its inference runtime included. There is no separate runtime to install, no model server to configure. Developers can pull SDKs from GitHub or the package managers, such as NuGet, npm, PyPI, Maven, and CocoaPods, and call the inference API directly.
06No ML Team RequiredpicoInference comes in a fully packaged SDK, meaning that there is no model architecture to choose, no quantization to configure, no calibration data to prepare, no kernel to write. Each runtime arrives ready to run with the built-in models without requiring any machine learning expertise.
07Enterprise ReadypicoInference is the runtime stack behind Picovoice deployments at Fortune 500 enterprises in regulated industries. Each runtime is versioned, signed, and distributed through the same package managers your security and procurement teams already approve. Long-term support and NDA-protected engagements are available through enterprise plans.
Get Started

Run AI inference on every platform your product ships to.

Picovoice distills years of research in model training and inference into production-grade on-device AI SDKs. Start Free to get up and running in hours, or talk to sales for enterprise support.

FAQ
+
What is on-device AI inference?

On-device AI inference is the process of executing a trained machine learning model directly on the user's hardware, without sending data to a cloud server. The runtime that performs this execution is called an on-device inference engine. picoInference is the technology behind Picovoice voice, language and vision products.

+
What is the difference between AI inference and AI training?

AI training is the process of building a machine learning model: feeding it data, adjusting its weights, and producing a final model file. If not done right, model sizes can be large, hence computationally expensive, requiring cloud infrastructure or specialized hardware to run. AI inference is the process of using that finished model to make predictions on new inputs, runs every time the application processes data, and is the only step a deployed product actually executes at runtime. Inference is therefore the dominant memory cost, latency source, and engineering surface area of most production AI systems.

+
How does picoInference compare to PyTorch Mobile, TensorFlow Lite, ONNX Runtime, and llama.cpp?

PyTorch Mobile, ExecuTorch, TensorFlow Lite (LiteRT), and ONNX Runtime are all generic runtimes designed to execute any model. They require model conversion and per-platform tuning, and their performance is bound by what a generic implementation can do. llama.cpp is purpose-built but covers only LLM-class models. picoInference powers Picovoice voice, language and vision models, optimized for each model class, and hardware and software platform it's expected to run. With picoInference, voice models, language models, and vision models each get a runtime built for their architecture, with native execution on every supported platform and no conversion step.

+
Can picoInference run LLMs on edge devices?

Yes. picoLLM Inference Engine, the LLM-specific runtime in the picoInference family, runs open-weight large language models, including Gemma, Llama, Mistral, Mixtral, and Phi, on CPU and GPU across desktop, mobile, embedded, and web environments. It is purpose-built to execute models compressed by picoCompression, which uses variable bit-rate quantization (e.g., 2.56 bits per weight on average) that no standard inference engine can run.

+
What platforms does picoInference run on?

Every picoInference runtime targets the same set of platforms natively: Linux, Windows, macOS, Raspberry Pi, ARM Cortex-A, ARM Cortex-M microcontrollers, Android, iOS, and all major browsers (Chrome, Safari, Firefox, Edge) via WebAssembly. GPU acceleration is supported through CUDA, Metal, and WebGPU, where available. Please check each product's documentation for the specific platforms it ships on:

+
Why does picoInference achieve such low CPU utilization?

picoInference is purpose built for cross-platform on-device AI inference. Each runtime is optimized for its specific model and platform. Generic runtimes are usually repurposed from server runtimes and make decisions at runtime via dispatch logic, which introduces overhead that an optimized runtime can avoid. The result is single-digit CPU utilization on commodity hardware for real-time models like wake word and VAD.