Picovoice Wordmark
Start Free
Introduction
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryPicovoice picoLLMGPTQ
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeWeb
SummaryPicovoice LeopardAmazon TranscribeAzure Speech-to-TextGoogle ASRGoogle ASR (Enhanced)IBM Watson Speech-to-TextWhisper Speech-to-Text
FAQ
Introduction
AndroidC.NETFlutteriOSJavaLinuxmacOSNode.jsPythonRaspberry PiReactReact NativeWebWindows
AndroidC.NETFlutteriOSJavaNode.jsPythonReactReact NativeWeb
SummaryPicovoice CheetahAzure Real-Time Speech-to-TextAmazon Transcribe StreamingGoogle Streaming ASRMoonshine StreamingVosk StreamingWhisper.cpp Streaming
FAQ
Introduction
AndroidC.NETiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSNode.jsPythonWeb
SummaryAmazon PollyAzure TTSElevenLabsOpenAI TTSPicovoice OrcaChatterbox-TTS-TurboKokoro-TTSKitten-TTS-Nano-0.8-INT8Pocket-TTSNeu-TTS-Nano-Q4-GGUFPiper-TTSSoprano-TTSSupertonic-TTS-2ESpeak-NG
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice KoalaMozilla RNNoise
Introduction
AndroidCiOSLinuxmacOSNode.jsPythonRaspberry PiWebWindows
AndroidCNode.jsPythoniOSWeb
SummaryPicovoice EaglepyannoteSpeechBrain
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
SummaryPicovoice FalconAmazon TranscribeAzure Speech-to-TextGoogle Speech-to-Textpyannote
Introduction
AndroidArduinoCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiReactReact NativeSafariWebWindows
AndroidC.NETFlutteriOSJavaMicrocontrollerNode.jsPythonReactReact NativeWeb
SummaryPicovoice PorcupineSnowboyPocketSphinx
Wake Word TipsFAQ
Introduction
AndroidArduinoCChrome.NETEdgeFirefoxFlutteriOSJavaLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiReactReact NativeSafariWebWindows
AndroidC.NETFlutteriOSJavaMicrocontrollerNode.jsPythonReactReact NativeWeb
SummaryPicovoice RhinoGoogle DialogflowAmazon LexIBM WatsonMicrosoft LUIS
Expression SyntaxFAQ
Introduction
AndroidArduinoC.NETiOSLinuxmacOSMicrocontrollerNode.jsPythonRaspberry PiWebWindows
AndroidC.NETiOSMicrocontrollerNode.jsPythonWeb
SummaryPicovoice CobraWebRTC VADSilero VAD
FAQ
Introduction
AndroidCiOSLinuxmacOSPythonRaspberry PiWebWindows
AndroidCiOSPythonWeb
Introduction
AndroidC.NETFlutteriOSNode.jsPythonReact NativeWeb
AndroidC.NETFlutteriOSNode.jsPythonReact NativeWeb
Introduction
C.NETNode.jsPython
C.NETNode.jsPython
FAQGlossary

Text-to-Speech Benchmark

TLDR: Orca Streaming Text-to-Speech is the fastest text-to-speech engine for voice AI agents, with 127ms first-token-to-speech latency, 2.6× faster than the best cloud alternative (ElevenLabs Streaming at 335ms), ~11× faster than the fastest on-device open-source alternative (ESpeak-NG at 1,400ms), 22× faster than OpenAI TTS, and 346× faster than Chatterbox-TTS-Turbo.

Orca is also the most efficient on-device TTS engine: 29MB peak memory (320MB–7,500MB for alternatives), 0.16 core-hour ratio (119× lower than Chatterbox-TTS-Turbo), and a 7 MB model, 425× smaller than Chatterbox-TTS-Turbo.

This open-source TTS benchmark evaluates 15 text-to-speech (TTS) engines on the metrics that matter most for production voice AI agents:

  • Latency — First-token-to-speech, voice assistant response time
  • Resource utilization — Core-hour ratio, memory footprint, model size
  • Audio quality — Subjective naturalness and human-like prosody

TTS Engines Compared

Four cloud APIs and ten on-device engines, organized below by deployment model:

  • Cloud APIs:
    • Amazon Polly.
    • Azure Text-to-Speech.
    • ElevenLabs.
      • With streaming audio output only.
      • With streaming input using WebSocket API.
    • OpenAI TTS.
  • On-device TTS running on CPU:
    • Picovoice Orca.
    • Chatterbox-TTS-Turbo.
    • ESpeak-NG.
    • Kitten-TTS-Nano-0.8-INT8.
    • Kokoro-TTS.
    • Neu-TTS-Nano-Q4-GGUF.
    • Piper-TTS.
    • Pocket-TTS.
    • Soprano-TTS.
    • Supertonic-TTS-2.

Methodology

This benchmark simulates user–voice AI agent interactions by generating LLM responses to user questions and synthesizing the response to speech as soon as possible. User queries are sampled from a public dataset and fed to picoLLM (llama-3.2-1b-instruct-385). picoLLM generates responses token-by-token. Those tokens are passed to each engine to measure response time and resource utilization.

Engines

TTS engines differ in how they handle input and output streaming, which is a key factor in voice AI agent latency. Most engines in this benchmark support streaming audio output and play audio incrementally. Exceptions are Chatterbox-TTS-Turbo and Kitten-TTS-Nano. They require the full synthesis of audio data before playing it. (See single synthesis.)

Only ElevenLabs and Orca support both streaming input and streaming audio output. They start synthesis without waiting for the full text input and play audio incrementally as speech is synthesized. (See streaming synthesis.) ElevenLabs supports streaming input via WebSocket by chunking text at punctuation marks and sending pre-analyzed chunks to the engine. Picovoice Orca Streaming Text-to-Speech processes raw LLM tokens as soon as they are produced, without requiring special language markers.

Dataset

User queries are drawn from the public taskmaster2 dataset, which contains goal-oriented conversations between users and AI agents. Queries cover flight booking, food ordering, hotel booking, movies, music recommendations, restaurant search, and sports. The LLM is prompted to respond like a helpful voice AI agent, with responses varying from a few words to several sentences to reflect a realistic range of voice assistant interactions.

Latency Metrics

All latency metrics measure time-to-first-byte: the time from when a request is sent until the first byte of the response is received. The three metrics below isolate different parts of the voice agent pipeline. Lower is better.

  • Voice Assistant Response Time (VART): The time from the moment the user's request is sent to the LLM until the TTS engine produces the first byte of speech. This is the end-to-end latency users perceive in a live voice assistant interaction. VART = TTFT + FTTS.
  • First Token To Speech (FTTS): The time from the LLM's first output token to the TTS engine's first byte of speech audio. FTTS is the most direct measure of TTS engine responsiveness, holding LLM behavior constant across all experiments to isolate the TTS contribution.
  • Time to First Token (TTFT): The time from the user's request to the LLM's first output token. TTFT depends on the inference engine, language model, network latency, and prompt. These variables exist outside the TTS engine, so this benchmark uses picoLLM, which provides a guaranteed response time, to minimize TTFT variance across experiments and isolate TTS performance.

Efficiency Metrics

Cloud TTS engines are excluded from efficiency metrics. Unlike on-device alternatives, cloud TTS APIs have no downloadable model and their compute and memory run on vendor infrastructure, preventing independent measurement. Lower is better.

  • CPU Core Hour Ratio: The number of CPU core-hours required to synthesize one hour of audio. A ratio of 1.0 means the engine fully occupies one CPU core to synthesize in real-time. A ratio below 1.0 means the engine can synthesize faster than real-time on a single core; above 1.0 means it cannot.
  • Peak Memory (RAM) Usage: Maximum RAM consumed by the TTS engine during synthesis, excluding LLM inference and Python setup overhead.
    • Memory Usage vs. Available Memory: Memory availability is not the same as total device RAM. Background services (SSH, networking, logging) consume memory before any application starts. As a practical guideline, a TTS engine used in a real-time voice AI application should be treated as part of the total app memory budget, which on mobile typically should not exceed 150–200 MB on low-end devices to avoid out-of-memory (OOM) termination. Both Android and iOS use low-memory killers that terminate processes when free memory falls below a threshold, and apps consuming more memory are killed first.
  • Model Size: Total binary file size required to run the engine, excluding common Python packages such as PyTorch. Includes the grapheme-to-phoneme (G2P) component where required (e.g., espeak-ng, misaki). Model size affects application download size and over-the-air update feasibility, which is critical for mobile and web deployments.

Results

The Open-Source TTS Benchmark figures below show the response times and efficiency of each engine by calculating the average over roughly 200 simulated user - voice AI agent interactions.

Latency

These two charts isolate the two time periods of voice AI latency: end-to-end response time (VART), and the TTS engine's portion of the pipeline (FTTS).

Voice Assistant Response Time

Voice Assistant Response Time (VART) measures the end-to-end latency from user request to first audio output. VART is the latency users actually experience, as they perceive the total delay until the AI agent starts responding to them. According to turn-taking research in spoken dialogue systems, 200ms feels instantaneous; above 1,000ms (1 second), the interaction feels broken.

Only Picovoice Orca achieves instantaneous response under the 200ms threshold. ElevenLabs Streaming is the only other engine under the 1,000ms "natural interaction" threshold. ESpeak-NG is the second-best on-device engine after Orca at 1,500ms, matching the second-best cloud engine, ElevenLabs TTS.

Chatterbox-TTS-Turbo (44.8s) and Kitten TTS Nano (10.7s) introduce delays of 10–45 seconds, effectively unusable for real-time voice, because they lack streaming output.

Horizontal bar chart titled 'Voice Assistant Response Time' comparing end-to-end latency (lower is better). Picovoice Orca leads at 204ms, followed by ElevenLabs Streaming (504ms), ESPEAK-NG (1,504ms), ElevenLabs TTS (1,548ms), Piper TTS (1,587ms), Amazon Polly (1,614ms), Azure TTS (1,656ms), Soprano TTS (1,665ms), Pocket TTS (1,744ms), Supertonic TTS 2 (2,526ms), Neu TTS Nano Q4 GGUF (2,629ms), OpenAI TTS (2,925ms), Kokoro TTS (3,000ms), Kitten TTS Nano 0.8 INT8 (*10,741ms), and Chatterbox TTS Turbo (*44,804ms). Asterisks indicate values capped at 4000ms on the visual scale.

Lowest Voice Assistant Response Time: Picovoice Orca (200 ms)
Highest Voice Assistant Response Time (Cloud): OpenAI Text-to-Speech (2930 ms)
Highest Voice Assistant Response Time (On-device): Chatterbox-TTS-Turbo (44,800 ms)

First Token to Speech

First Token to Speech Latency (FTTS) measures the time from the LLM's first output token to the first byte of speech audio. It’s the cleanest comparison of TTS engine responsiveness, removing the LLM factor from VART.

Picovoice Orca leads with 127ms, 2.6× faster than the nearest competitor (ElevenLabs Streaming at 335ms) and ~11× faster than both standard cloud APIs and ESpeak-NG (1,400ms), the fastest on-device open-source alternative.

Horizontal bar chart titled 'First Token to Speech' comparing TTS latency only (lower is better). Picovoice Orca leads at 127ms, followed by ElevenLabs Streaming (335ms), ESPEAK-NG (1,430ms), ElevenLabs TTS (1,473ms), Piper TTS (1,511ms), Amazon Polly (1,538ms), Azure TTS (1,576ms), Soprano TTS (1,586ms), Pocket TTS (1,666ms), Supertonic TTS 2 (2,450ms), Neu TTS Nano Q4 GGUF (2,550ms), OpenAI TTS (2,848ms), Kokoro TTS (2,923ms), Kitten TTS Nano 0.8 INT8 (*10,667ms), and Chatterbox TTS Turbo (*44,707ms). Asterisks indicate values capped at 4000ms on the visual scale.

Lowest First Token to Speech Latency: Picovoice Orca (130 ms)
Highest First Token to Speech Latency (Cloud): OpenAI Text-to-Speech  (2,850 ms)
Highest First Token to Speech Latency (On-device): Chatterbox-TTS-Turbo (44,710 ms)

FTTS and VART differ only by the constant TTFT (time to first LLM token) across all engines. The near-identical rankings confirm the TTS engine, not the LLM, is the dominant contributor to voice agent latency.

Efficiency

All efficiency metrics are measured on AMD Ryzen 7 5700X with 64GB RAM.

CPU Core Hour Ratio

CPU Core Hour Ratio measures the compute cost required to synthesize one hour of audio. Picovoice Orca demonstrates unmatched efficiency at 0.16 core-hours: the closest alternative, Pocket-TTS, requires more than 2.3× the resources, and Chatterbox-TTS-Turbo requires nearly 120× the resources Orca does.

Put concretely: Orca synthesizes an hour of speech using just 9.6 CPU-minutes, 16% of one core for real-time playback. Chatterbox-TTS-Turbo needs 19 full CPU cores at 100% utilization for the same output.

Horizontal bar chart titled 'Core Hour Ratio' comparing CPU efficiency (lower is better). Picovoice Orca leads at 0.16×, followed by Pocket TTS (0.37×), Piper TTS (0.54×), Supertonic TTS 2 (0.84×), Kokoro TTS (1.40×), Kitten TTS Nano 0.8 INT8 (5.13×), Soprano TTS (5.71×), Neu TTS Nano Q4 GGUF (9.84×), and Chatterbox TTS Turbo (19.0×).

Lowest CPU Utilization: Picovoice Orca (0.16x)
Highest CPU Utilization: Chatterbox-TTS-Turbo (19x)

Peak Memory Usage

Peak memory usage shows the maximum memory consumption during audio synthesis. 

Picovoice Orca uses 29MB peak memory, suitable for any deployment environment. Kitten TTS Nano, the second most efficient alternative, requires 320MB (~11× Orca) and is viable for certain mobile devices, whereas Chatterbox-TTS-Turbo at 7,500MB (7.5GB, ~260× Orca) is not suitable for resource-constrained environments.

Horizontal bar chart titled 'Peak Memory Usage' comparing RAM consumption (lower is better). Picovoice Orca uses just 28 MB, followed by Kitten TTS Nano 0.8 INT8 (323 MB), Supertonic TTS 2 (515 MB), Pocket TTS (617 MB), Soprano TTS (710 MB), Kokoro TTS (1,884 MB), Neu TTS Nano Q4 GGUF (2,064 MB), Piper TTS (2,579 MB), and Chatterbox TTS Turbo (7,505 MB).

Lowest Peak Memory: Picovoice Orca (29 MB)
Highest Peak Memory: Chatterbox-TTS-Turbo (7500 MB)

Engines fall into four viable deployment buckets based on peak memory: <30 MB (suitable for all platforms including embedded and web), 300–600 MB (mid-range mobile and desktop), 600–750 MB (high-end mobile only), and >1 GB (desktop/server only). Sample audio by bucket is compared in the Audio Quality section below.

Model Size

Model size affects application download size and storage requirements — particularly important for mobile and web applications, where lean binaries improve the user's first experience.

Orca's 7 MB model is 425× smaller than Chatterbox-TTS-Turbo's 2,980 MB, the largest gap of any metric in this benchmark. Small model size enables over-the-air updates, reduces first-launch download size, and fits easily on storage-constrained hardware such as Raspberry Pi and older mobile devices.

Horizontal bar chart titled 'Model Size' comparing the on-disk footprint of TTS alternatives. Picovoice Orca is smallest at just 7 MB, followed by Kitten TTS Nano 0.8 INT8 (42 MB), Piper TTS (61 MB), Pocket TTS (242 MB), Supertonic TTS 2 (262 MB), Soprano TTS (280 MB), Kokoro TTS (341 MB), Neu TTS Nano Q4 GGUF (507 MB), and Chatterbox TTS Turbo (2,980 MB).

Smallest Model Size: Picovoice Orca (7 MB)
Largest Model Size: Chatterbox-TTS-Turbo (2980 MB)

Across all three efficiency metrics, Picovoice Orca places first with a 2–11× lead over the next-best alternative. No other engine leads on more than one efficiency metric.

Audio Quality

Audio quality can be evaluated through direct listening comparison using the same test utterances across all engines. Since naturalness and audio quality are subjective, no automated metric is used. Sample audio files are grouped by peak memory usage, which is the primary constraint determining which engines are deployable on a given device. An engine with excellent audio quality but a 2 GB memory footprint is not a viable choice for mobile or embedded applications, regardless of how natural it sounds.

Group 1: Peak Memory Usage < 30 MB

Picovoice Orca and ESpeak-NG require less than 30 MB of memory while running, making them suitable for all types of applications and platforms, including resource-constrained embedded systems and web browsers.

Picovoice Orca
Your browser does not support the audio player.
ESpeak-NG
Your browser does not support the audio player.
Group 2: Peak Memory Usage 300 MB – 600 MB

Kitten-TTS-Nano-0.8-INT8 and Supertonic-TTS-2 fall in this category. They’re suitable for mid-range and above mobile and desktop applications, but they exceed standard embedded headroom.

Kitten-TTS-Nano-0.8-INT8
Your browser does not support the audio player.
Supertonic-TTS-2
Your browser does not support the audio player.
Group 3: Peak Memory Usage 600 MB – 750 MB  

TTS engines, Pocket-TTS and Soprano-TTS, in this group can run on high-end mobile devices and standard desktop environments. They’re not a fit for legacy or mid-range mobile devices and embedded. 

Pocket-TTS
Your browser does not support the audio player.
Soprano-TTS
Your browser does not support the audio player.
Group 4: Peak Memory Usage > 1 GB

This group of TTS engines, Kokoro-TTS, Neu-TTS-Nano-Q4-GGUF, Piper-TTS, and Chatterbox-TTS-Turbo, requires powerful devices. They may hinder the UX when used in real-time applications to run on average consumer devices.

Kokoro-TTS
Your browser does not support the audio player.
Neu-TTS-Nano-Q4-GGUF
Your browser does not support the audio player.
Piper-TTS
Your browser does not support the audio player.
Chatterbox-TTS-Turbo
Your browser does not support the audio player.

Platform Support

Platform support covers the operating systems, runtimes, and hardware targets officially supported by each engine. Picovoice Orca is the only engine in this benchmark with native production SDKs across Linux, macOS, Windows, Android, iOS, Raspberry Pi 3/4/5, and all major browsers simultaneously.

Cloud TTS APIs (Amazon Polly, Azure TTS, ElevenLabs TTS, ElevenLabs Streaming, OpenAI TTS) are excluded — they run on vendor infrastructure and have no on-device deployment.

PlatformPicovoice
Orca
Kitten TTS
Nano INT8
Supertonic
TTS 2
Neu TTS
Nano Q4
Kokoro
TTS
Chatterbox
Turbo
ESpeak
NG
Pocket
TTS
Piper
TTS
Soprano
TTS
Desktop
Linux (x86_64)✓✓✓✓✓✓✓✓✓✓
macOS (x86_64)✓✓✓✓✓✓✓✓✓✓
macOS (arm64)✓✓✓✓✓✓✓✓✓✓
Windows (x86_64)✓✓✓✓✓✓✓✓✓✓
Windows (arm64)✓✓✓✓–✓––––
Mobile
Android✓✓✓✓––✓–––
iOS✓✓✓✓––––––
Embedded
Raspberry Pi 3✓✓✓–––––––
Raspberry Pi 4✓✓✓–––––––
Raspberry Pi 5✓✓✓–––––––
Browsers
Chrome✓✓✓–✓–––––
Safari✓✓✓–✓–––––
Firefox✓✓✓–✓–––––
Edge✓✓✓–✓–––––
✓ Supported
– Not supported

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

  • Amazon Polly
  • Azure TTS
  • ElevenLabs
  • OpenAI TTS
  • Picovoice Orca
  • Chatterbox-TTS-Turbo
  • Kokoro-TTS
  • Kitten-TTS-Nano-0.8-INT8
  • Pocket-TTS
  • Neu-TTS-Nano-Q4-GGUF
  • Piper-TTS
  • Soprano-TTS
  • Supertonic-TTS-2
  • ESpeak-NG

Was this doc helpful?

Issue with this doc?

Report a GitHub Issue
Text-to-Speech Benchmark
  • Methodology
  • Engines
  • Dataset
  • Latency Metrics
  • Efficiency Metrics
  • Results
  • Latency
  • Efficiency
  • Audio Quality
  • Platform Support
  • Usage
© 2019-2026 Picovoice Inc.PrivacyTerms