Question 1

What is on-device AI inference?

Accepted Answer

On-device AI inference is the process of executing a trained machine learning model directly on the user's hardware, without sending data to a cloud server. The runtime that performs this execution is called an on-device inference engine. picoInference is the technology behind Picovoice voice, language and vision products.

Question 2

What is the difference between AI inference and AI training?

Accepted Answer

AI training is the process of building a machine learning model: feeding it data, adjusting its weights, and producing a final model file. If not done right, model sizes can be large, hence computationally expensive, requiring cloud infrastructure or specialized hardware to run. AI inference is the process of using that finished model to make predictions on new inputs, runs every time the application processes data, and is the only step a deployed product actually executes at runtime. Inference is therefore the dominant memory cost, latency source, and engineering surface area of most production AI systems.

Question 3

How does picoInference compare to PyTorch Mobile, TensorFlow Lite, ONNX Runtime, and llama.cpp?

Accepted Answer

PyTorch Mobile, ExecuTorch, TensorFlow Lite (LiteRT), and ONNX Runtime are all generic runtimes designed to execute any model. They require model conversion and per-platform tuning, and their performance is bound by what a generic implementation can do. llama.cpp is purpose-built but covers only LLM-class models. picoInference powers Picovoice voice, language and vision models, optimized for each model class, and hardware and software platform it's expected to run. With picoInference, voice models, language models, and vision models each get a runtime built for their architecture, with native execution on every supported platform and no conversion step.

Question 4

What models does picoInference support?

Accepted Answer

picoInference executes on-device voice, language and vision models:Cobra Voice Activity DetectionPorcupine Wake WordRhino Speech-to-IntentCheetah Streaming Speech-to-TextLeopard Speech-to-TextOrca Streaming Text-to-SpeechEagle Speaker RecognitionBluebird Streaming Speaker DiarizationFalcon Speaker DiarizationBat Spoken Language IdentificationKoala Noise SuppressionZebra TranslatepicoLLM On-device Large Language ModelspicoOCR On-device Optical Character RecognitionpicoVLM On-device Vision-Language Models

Question 5

Can picoInference run LLMs on edge devices?

Accepted Answer

Yes. picoLLM Inference Engine, the LLM-specific runtime in the picoInference family, runs open-weight large language models, including Gemma, Llama, Mistral, Mixtral, and Phi, on CPU and GPU across desktop, mobile, embedded, and web environments. It is purpose-built to execute models compressed by picoCompression, which uses variable bit-rate quantization (e.g., 2.56 bits per weight on average) that no standard inference engine can run.

Question 6

What platforms does picoInference run on?

Accepted Answer

Every picoInference runtime targets the same set of platforms natively: Linux, Windows, macOS, Raspberry Pi, ARM Cortex-A, ARM Cortex-M microcontrollers, Android, iOS, and all major browsers (Chrome, Safari, Firefox, Edge) via WebAssembly. GPU acceleration is supported through CUDA, Metal, and WebGPU, where available. Please check each product's documentation for the specific platforms it ships on:Cobra Voice Activity Detection DocsPorcupine Wake Word DocsRhino Speech-to-Intent DocsCheetah Streaming Speech-to-Text DocsLeopard Speech-to-Text DocsOrca Streaming Text-to-Speech DocsEagle Speaker Recognition DocsBluebird Streaming Speaker Diarization (currently on beta, talk to sales)Falcon Speaker Diarization DocsBat Spoken Language Identification DocsKoala Noise Suppression DocsZebra Translate DocspicoLLM On-device Large Language Models DocspicoOCR On-device Optical Character Recognition DocspicoVLM On-device Vision-Language Models Docs

Question 7

Why does picoInference achieve such low CPU utilization?

Accepted Answer

picoInference is purpose built for cross-platform on-device AI inference. Each runtime is optimized for its specific model and platform. Generic runtimes are usually repurposed from server runtimes and make decisions at runtime via dispatch logic, which introduces overhead that an optimized runtime can avoid. The result is single-digit CPU utilization on commodity hardware for real-time models like wake word and VAD.

Cross-platform on-device AI inference for voice, language, and vision models

On-device AI inference engine for every AI model class and platform

What makes picoInference different from other on-device inference engines

Why enterprises choose picoInference On-device Inference Engine

Run AI inference on every platform your product ships to.