Purpose-built on-device AI Inference engine that efficiently runs voice, language, and vision models on embedded, mobile, web, and desktop applications, outperforming server-repurposed runtimes like PyTorch Mobile and ONNX.
Generic on-device inference engines like TensorFlow Lite, PyTorch Mobile, ExecuTorch, and ONNX Runtime are designed to execute any model. That generality has a cost: every deployment requires conversion, calibration, and per-platform tuning to extract real performance, and even then, the result is rarely as fast as a runtime built specifically for the model in question.
picoInference takes the opposite approach. Each Picovoice product ships with its own inference runtime, hand-optimized for that model architecture by the same team that designed the model. Cheetah Streaming Speech-to-Text has its runtime, so do Cobra Voice Activity Detection and Orca Streaming Text-to-Speech. Every runtime targets server, desktop, mobile, embedded, and web environments natively. No conversion step. The runtime is distributed inside the SDK.
Generic runtimes have to support a wide range of operators, model architectures, and hardware targets. To extract real performance, deployments typically rely on a stack of per-platform delegates or execution providers, such as NNAPI, GPU, Hexagon, Core ML, and CUDA, each maintained alongside the model.
picoInference inverts that tradeoff. Parallelized kernels are written for each operator on each instruction set. The result is real-time AI inference with single-digit CPU utilization, the ability to run across platforms, including devices as small as a Cortex-M4 microcontroller, and consistent low-latency execution across every supported platform.
picoInference is the on-device AI inference engine chosen by enterprises that need real-time, low-latency model execution on every platform their product ships to. Each runtime is optimized for the model class and specific hardware platforms, resulting in single-digit CPU utilization on commodity hardware while running entirely on-device with no cloud dependency.
Picovoice distills years of research in model training and inference into production-grade on-device AI SDKs. Start Free to get up and running in hours, or talk to sales for enterprise support.
On-device AI inference is the process of executing a trained machine learning model directly on the user's hardware, without sending data to a cloud server. The runtime that performs this execution is called an on-device inference engine. picoInference is the technology behind Picovoice voice, language and vision products.
AI training is the process of building a machine learning model: feeding it data, adjusting its weights, and producing a final model file. If not done right, model sizes can be large, hence computationally expensive, requiring cloud infrastructure or specialized hardware to run. AI inference is the process of using that finished model to make predictions on new inputs, runs every time the application processes data, and is the only step a deployed product actually executes at runtime. Inference is therefore the dominant memory cost, latency source, and engineering surface area of most production AI systems.
PyTorch Mobile, ExecuTorch, TensorFlow Lite (LiteRT), and ONNX Runtime are all generic runtimes designed to execute any model. They require model conversion and per-platform tuning, and their performance is bound by what a generic implementation can do. llama.cpp is purpose-built but covers only LLM-class models. picoInference powers Picovoice voice, language and vision models, optimized for each model class, and hardware and software platform it's expected to run. With picoInference, voice models, language models, and vision models each get a runtime built for their architecture, with native execution on every supported platform and no conversion step.
picoInference executes on-device voice, language and vision models:
Yes. picoLLM Inference Engine, the LLM-specific runtime in the picoInference family, runs open-weight large language models, including Gemma, Llama, Mistral, Mixtral, and Phi, on CPU and GPU across desktop, mobile, embedded, and web environments. It is purpose-built to execute models compressed by picoCompression, which uses variable bit-rate quantization (e.g., 2.56 bits per weight on average) that no standard inference engine can run.
Every picoInference runtime targets the same set of platforms natively: Linux, Windows, macOS, Raspberry Pi, ARM Cortex-A, ARM Cortex-M microcontrollers, Android, iOS, and all major browsers (Chrome, Safari, Firefox, Edge) via WebAssembly. GPU acceleration is supported through CUDA, Metal, and WebGPU, where available. Please check each product's documentation for the specific platforms it ships on:
picoInference is purpose built for cross-platform on-device AI inference. Each runtime is optimized for its specific model and platform. Generic runtimes are usually repurposed from server runtimes and make decisions at runtime via dispatch logic, which introduces overhead that an optimized runtime can avoid. The result is single-digit CPU utilization on commodity hardware for real-time models like wake word and VAD.