Text-to-Speech Benchmark
TLDR: Orca Streaming Text-to-Speech is the fastest text-to-speech engine for voice AI agents, with 127ms first-token-to-speech latency, 2.6× faster than the best cloud alternative (ElevenLabs Streaming at 335ms), ~11× faster than the fastest on-device open-source alternative (ESpeak-NG at 1,400ms), 22× faster than OpenAI TTS, and 346× faster than Chatterbox-TTS-Turbo.
Orca is also the most efficient on-device TTS engine: 29MB peak memory (320MB–7,500MB for alternatives), 0.16 core-hour ratio (119× lower than Chatterbox-TTS-Turbo), and a 7 MB model, 425× smaller than Chatterbox-TTS-Turbo.
This open-source TTS benchmark evaluates 15 text-to-speech (TTS) engines on the metrics that matter most for production voice AI agents:
- Latency — First-token-to-speech, voice assistant response time
- Resource utilization — Core-hour ratio, memory footprint, model size
- Audio quality — Subjective naturalness and human-like prosody
TTS Engines Compared
Four cloud APIs and ten on-device engines, organized below by deployment model:
- Cloud APIs:
- Amazon Polly.
- Azure Text-to-Speech.
- ElevenLabs.
- With streaming audio output only.
- With streaming input using WebSocket API.
- OpenAI TTS.
- On-device TTS running on CPU:
Methodology
This benchmark simulates user–voice AI agent interactions by generating LLM responses to user questions and synthesizing the response to speech as soon as possible. User queries are sampled from a public dataset and fed to picoLLM (llama-3.2-1b-instruct-385). picoLLM generates responses token-by-token. Those tokens are passed to each engine to measure response time and resource utilization.
Engines
TTS engines differ in how they handle input and output streaming, which is a key factor in voice AI agent latency. Most engines in this benchmark support streaming audio output and play audio incrementally. Exceptions are Chatterbox-TTS-Turbo and Kitten-TTS-Nano. They require the full synthesis of audio data before playing it. (See single synthesis.)
Only ElevenLabs and Orca support both streaming input and streaming audio output. They start synthesis without waiting for the full text input and play audio incrementally as speech is synthesized. (See streaming synthesis.) ElevenLabs supports streaming input via WebSocket by chunking text at punctuation marks and sending pre-analyzed chunks to the engine. Picovoice Orca Streaming Text-to-Speech processes raw LLM tokens as soon as they are produced, without requiring special language markers.
Dataset
User queries are drawn from the public taskmaster2 dataset, which contains goal-oriented conversations between users and AI agents. Queries cover flight booking, food ordering, hotel booking, movies, music recommendations, restaurant search, and sports. The LLM is prompted to respond like a helpful voice AI agent, with responses varying from a few words to several sentences to reflect a realistic range of voice assistant interactions.
Latency Metrics
All latency metrics measure time-to-first-byte: the time from when a request is sent until the first byte of the response is received. The three metrics below isolate different parts of the voice agent pipeline. Lower is better.
- Voice Assistant Response Time (VART): The time from the moment the user's request is sent to the LLM until the TTS engine produces the first byte of speech. This is the end-to-end latency users perceive in a live voice assistant interaction. VART = TTFT + FTTS.
- First Token To Speech (FTTS): The time from the LLM's first output token to the TTS engine's first byte of speech audio. FTTS is the most direct measure of TTS engine responsiveness, holding LLM behavior constant across all experiments to isolate the TTS contribution.
- Time to First Token (TTFT): The time from the user's request to the LLM's first output token. TTFT depends on the inference engine, language model, network latency, and prompt. These variables exist outside the TTS engine, so this benchmark uses picoLLM, which provides a guaranteed response time, to minimize TTFT variance across experiments and isolate TTS performance.
Efficiency Metrics
Cloud TTS engines are excluded from efficiency metrics. Unlike on-device alternatives, cloud TTS APIs have no downloadable model and their compute and memory run on vendor infrastructure, preventing independent measurement. Lower is better.
- CPU Core Hour Ratio: The number of CPU core-hours required to synthesize one hour of audio. A ratio of 1.0 means the engine fully occupies one CPU core to synthesize in real-time. A ratio below 1.0 means the engine can synthesize faster than real-time on a single core; above 1.0 means it cannot.
- Peak Memory (RAM) Usage: Maximum RAM consumed by the TTS engine during synthesis, excluding LLM inference and Python setup overhead.
- Memory Usage vs. Available Memory: Memory availability is not the same as total device RAM. Background services (SSH, networking, logging) consume memory before any application starts. As a practical guideline, a TTS engine used in a real-time voice AI application should be treated as part of the total app memory budget, which on mobile typically should not exceed 150–200 MB on low-end devices to avoid out-of-memory (OOM) termination. Both Android and iOS use low-memory killers that terminate processes when free memory falls below a threshold, and apps consuming more memory are killed first.
- Model Size: Total binary file size required to run the engine, excluding common Python packages such as PyTorch. Includes the grapheme-to-phoneme (G2P) component where required (e.g.,
espeak-ng,misaki). Model size affects application download size and over-the-air update feasibility, which is critical for mobile and web deployments.
Results
The Open-Source TTS Benchmark figures below show the response times and efficiency of each engine by calculating the average over roughly 200 simulated user - voice AI agent interactions.
Latency
These two charts isolate the two time periods of voice AI latency: end-to-end response time (VART), and the TTS engine's portion of the pipeline (FTTS).
Voice Assistant Response Time
Voice Assistant Response Time (VART) measures the end-to-end latency from user request to first audio output. VART is the latency users actually experience, as they perceive the total delay until the AI agent starts responding to them. According to turn-taking research in spoken dialogue systems, 200ms feels instantaneous; above 1,000ms (1 second), the interaction feels broken.
Only Picovoice Orca achieves instantaneous response under the 200ms threshold. ElevenLabs Streaming is the only other engine under the 1,000ms "natural interaction" threshold. ESpeak-NG is the second-best on-device engine after Orca at 1,500ms, matching the second-best cloud engine, ElevenLabs TTS.
Chatterbox-TTS-Turbo (44.8s) and Kitten TTS Nano (10.7s) introduce delays of 10–45 seconds, effectively unusable for real-time voice, because they lack streaming output.

Lowest Voice Assistant Response Time: Picovoice Orca (200 ms)
Highest Voice Assistant Response Time (Cloud): OpenAI Text-to-Speech (2930 ms)
Highest Voice Assistant Response Time (On-device): Chatterbox-TTS-Turbo (44,800 ms)
First Token to Speech
First Token to Speech Latency (FTTS) measures the time from the LLM's first output token to the first byte of speech audio. It’s the cleanest comparison of TTS engine responsiveness, removing the LLM factor from VART.
Picovoice Orca leads with 127ms, 2.6× faster than the nearest competitor (ElevenLabs Streaming at 335ms) and ~11× faster than both standard cloud APIs and ESpeak-NG (1,400ms), the fastest on-device open-source alternative.

Lowest First Token to Speech Latency: Picovoice Orca (130 ms)
Highest First Token to Speech Latency (Cloud): OpenAI Text-to-Speech (2,850 ms)
Highest First Token to Speech Latency (On-device): Chatterbox-TTS-Turbo (44,710 ms)
FTTS and VART differ only by the constant TTFT (time to first LLM token) across all engines. The near-identical rankings confirm the TTS engine, not the LLM, is the dominant contributor to voice agent latency.
Efficiency
All efficiency metrics are measured on AMD Ryzen 7 5700X with 64GB RAM.
CPU Core Hour Ratio
CPU Core Hour Ratio measures the compute cost required to synthesize one hour of audio. Picovoice Orca demonstrates unmatched efficiency at 0.16 core-hours: the closest alternative, Pocket-TTS, requires more than 2.3× the resources, and Chatterbox-TTS-Turbo requires nearly 120× the resources Orca does.
Put concretely: Orca synthesizes an hour of speech using just 9.6 CPU-minutes, 16% of one core for real-time playback. Chatterbox-TTS-Turbo needs 19 full CPU cores at 100% utilization for the same output.

Lowest CPU Utilization: Picovoice Orca (0.16x)
Highest CPU Utilization: Chatterbox-TTS-Turbo (19x)
Peak Memory Usage
Peak memory usage shows the maximum memory consumption during audio synthesis.
Picovoice Orca uses 29MB peak memory, suitable for any deployment environment. Kitten TTS Nano, the second most efficient alternative, requires 320MB (~11× Orca) and is viable for certain mobile devices, whereas Chatterbox-TTS-Turbo at 7,500MB (7.5GB, ~260× Orca) is not suitable for resource-constrained environments.

Lowest Peak Memory: Picovoice Orca (29 MB)
Highest Peak Memory: Chatterbox-TTS-Turbo (7500 MB)
Engines fall into four viable deployment buckets based on peak memory: <30 MB (suitable for all platforms including embedded and web), 300–600 MB (mid-range mobile and desktop), 600–750 MB (high-end mobile only), and >1 GB (desktop/server only). Sample audio by bucket is compared in the Audio Quality section below.
Model Size
Model size affects application download size and storage requirements — particularly important for mobile and web applications, where lean binaries improve the user's first experience.
Orca's 7 MB model is 425× smaller than Chatterbox-TTS-Turbo's 2,980 MB, the largest gap of any metric in this benchmark. Small model size enables over-the-air updates, reduces first-launch download size, and fits easily on storage-constrained hardware such as Raspberry Pi and older mobile devices.

Smallest Model Size: Picovoice Orca (7 MB)
Largest Model Size: Chatterbox-TTS-Turbo (2980 MB)
Across all three efficiency metrics, Picovoice Orca places first with a 2–11× lead over the next-best alternative. No other engine leads on more than one efficiency metric.
Audio Quality
Audio quality can be evaluated through direct listening comparison using the same test utterances across all engines. Since naturalness and audio quality are subjective, no automated metric is used. Sample audio files are grouped by peak memory usage, which is the primary constraint determining which engines are deployable on a given device. An engine with excellent audio quality but a 2 GB memory footprint is not a viable choice for mobile or embedded applications, regardless of how natural it sounds.
Group 1: Peak Memory Usage < 30 MB
Picovoice Orca and ESpeak-NG require less than 30 MB of memory while running, making them suitable for all types of applications and platforms, including resource-constrained embedded systems and web browsers.
Picovoice Orca
ESpeak-NG
Group 2: Peak Memory Usage 300 MB – 600 MB
Kitten-TTS-Nano-0.8-INT8 and Supertonic-TTS-2 fall in this category. They’re suitable for mid-range and above mobile and desktop applications, but they exceed standard embedded headroom.
Kitten-TTS-Nano-0.8-INT8
Supertonic-TTS-2
Group 3: Peak Memory Usage 600 MB – 750 MB
TTS engines, Pocket-TTS and Soprano-TTS, in this group can run on high-end mobile devices and standard desktop environments. They’re not a fit for legacy or mid-range mobile devices and embedded.
Pocket-TTS
Soprano-TTS
Group 4: Peak Memory Usage > 1 GB
This group of TTS engines, Kokoro-TTS, Neu-TTS-Nano-Q4-GGUF, Piper-TTS, and Chatterbox-TTS-Turbo, requires powerful devices. They may hinder the UX when used in real-time applications to run on average consumer devices.
Kokoro-TTS
Neu-TTS-Nano-Q4-GGUF
Piper-TTS
Chatterbox-TTS-Turbo
Platform Support
Platform support covers the operating systems, runtimes, and hardware targets officially supported by each engine. Picovoice Orca is the only engine in this benchmark with native production SDKs across Linux, macOS, Windows, Android, iOS, Raspberry Pi 3/4/5, and all major browsers simultaneously.
Cloud TTS APIs (Amazon Polly, Azure TTS, ElevenLabs TTS, ElevenLabs Streaming, OpenAI TTS) are excluded — they run on vendor infrastructure and have no on-device deployment.
| Platform | Picovoice Orca | Kitten TTS Nano INT8 | Supertonic TTS 2 | Neu TTS Nano Q4 | Kokoro TTS | Chatterbox Turbo | ESpeak NG | Pocket TTS | Piper TTS | Soprano TTS |
|---|---|---|---|---|---|---|---|---|---|---|
| Desktop | ||||||||||
| Linux (x86_64) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| macOS (x86_64) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| macOS (arm64) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Windows (x86_64) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Windows (arm64) | ✓ | ✓ | ✓ | ✓ | – | ✓ | – | – | – | – |
| Mobile | ||||||||||
| Android | ✓ | ✓ | ✓ | ✓ | – | – | ✓ | – | – | – |
| iOS | ✓ | ✓ | ✓ | ✓ | – | – | – | – | – | – |
| Embedded | ||||||||||
| Raspberry Pi 3 | ✓ | ✓ | ✓ | – | – | – | – | – | – | – |
| Raspberry Pi 4 | ✓ | ✓ | ✓ | – | – | – | – | – | – | – |
| Raspberry Pi 5 | ✓ | ✓ | ✓ | – | – | – | – | – | – | – |
| Browsers | ||||||||||
| Chrome | ✓ | ✓ | ✓ | – | ✓ | – | – | – | – | – |
| Safari | ✓ | ✓ | ✓ | – | ✓ | – | – | – | – | – |
| Firefox | ✓ | ✓ | ✓ | – | ✓ | – | – | – | – | – |
| Edge | ✓ | ✓ | ✓ | – | ✓ | – | – | – | – | – |
Usage
The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents: