Text-to-Speech Benchmark

TLDR: Orca Streaming Text-to-Speech is the fastest text-to-speech engine for voice AI agents, with 128ms first-token-to-speech latency, 2.6× faster than the best cloud alternative (ElevenLabs Streaming at 335ms), ~11× faster than the fastest on-device open-source alternative (ESpeak-NG at 1,430ms), 22× faster than OpenAI TTS, and 346× faster than Chatterbox-TTS-Turbo.

Orca is also the most efficient on-device TTS engine: 29MB peak memory (320MB–7,500MB for alternatives), 0.16 core-hour ratio (119× lower than Chatterbox-TTS-Turbo), and a 7 MB model, 425× smaller than Chatterbox-TTS-Turbo.

This open-source TTS benchmark evaluates 15 text-to-speech (TTS) engines on the metrics that matter most for production voice AI agents:

Latency — First-token-to-speech, voice assistant response time
Resource utilization — Core-hour ratio, memory footprint, model size
Audio quality — Subjective naturalness and human-like prosody

TTS Engines Compared

Four cloud APIs and ten on-device engines, organized below by deployment model:

Cloud APIs:
- Amazon Polly.
- Azure Text-to-Speech.
- ElevenLabs.
  - With streaming audio output only.
  - With streaming input using WebSocket API.
- OpenAI TTS.
On-device TTS running on CPU:

Methodology

This benchmark simulates user–voice AI agent interactions by generating LLM responses to user questions and synthesizing the response to speech as soon as possible. User queries are sampled from a public dataset and fed to picoLLM (llama-3.2-1b-instruct-385). picoLLM generates responses token-by-token. Those tokens are passed to each engine to measure response time and resource utilization.

Engines

TTS engines differ in how they handle input and output streaming, which is a key factor in voice AI agent latency. Most engines in this benchmark support streaming audio output and play audio incrementally. Exceptions are Chatterbox-TTS-Turbo and Kitten-TTS-Nano. They require the full synthesis of audio data before playing it. (See single synthesis.)

Only ElevenLabs and Orca support both streaming input and streaming audio output. They start synthesis without waiting for the full text input and play audio incrementally as speech is synthesized. (See streaming synthesis.) ElevenLabs supports streaming input via WebSocket by chunking text at punctuation marks and sending pre-analyzed chunks to the engine. Picovoice Orca Streaming Text-to-Speech processes raw LLM tokens as soon as they are produced, without requiring special language markers.

Dataset

User queries are drawn from the public taskmaster2 dataset, which contains goal-oriented conversations between users and AI agents. Queries cover flight booking, food ordering, hotel booking, movies, music recommendations, restaurant search, and sports. The LLM is prompted to respond like a helpful voice AI agent, with responses varying from a few words to several sentences to reflect a realistic range of voice assistant interactions.

Latency Metrics

All latency metrics measure time-to-first-byte: the time from when a request is sent until the first byte of the response is received. The three metrics below isolate different parts of the voice agent pipeline. Lower is better.

Voice Assistant Response Time (VART): The time from the moment the user's request is sent to the LLM until the TTS engine produces the first byte of speech. This is the end-to-end latency users perceive in a live voice assistant interaction. VART = TTFT + FTTS.
First Token To Speech (FTTS): The time from the LLM's first output token to the TTS engine's first byte of speech audio. FTTS is the most direct measure of TTS engine responsiveness, holding LLM behavior constant across all experiments to isolate the TTS contribution.
Time to First Token (TTFT): The time from the user's request to the LLM's first output token. TTFT depends on the inference engine, language model, network latency, and prompt. These variables exist outside the TTS engine, so this benchmark uses picoLLM, which provides a guaranteed response time, to minimize TTFT variance across experiments and isolate TTS performance.

Efficiency Metrics

Cloud TTS engines are excluded from efficiency metrics. Unlike on-device alternatives, cloud TTS APIs have no downloadable model and their compute and memory run on vendor infrastructure, preventing independent measurement. Lower is better.

CPU Core Hour Ratio: The number of CPU core-hours required to synthesize one hour of audio. A ratio of 1.0 means the engine fully occupies one CPU core to synthesize in real-time. A ratio below 1.0 means the engine can synthesize faster than real-time on a single core; above 1.0 means it cannot.
Peak Memory (RAM) Usage: Maximum RAM consumed by the TTS engine during synthesis, excluding LLM inference and Python setup overhead.
- Memory Usage vs. Available Memory: Memory availability is not the same as total device RAM. Background services (SSH, networking, logging) consume memory before any application starts. As a practical guideline, a TTS engine used in a real-time voice AI application should be treated as part of the total app memory budget, which on mobile typically should not exceed 150–200 MB on low-end devices to avoid out-of-memory (OOM) termination. Both Android and iOS use low-memory killers that terminate processes when free memory falls below a threshold, and apps consuming more memory are killed first.
Model Size: Total binary file size required to run the engine, excluding common Python packages such as PyTorch. Includes the grapheme-to-phoneme (G2P) component where required (e.g., espeak-ng, misaki). Model size affects application download size and over-the-air update feasibility, which is critical for mobile and web deployments.

Results

The Open-Source TTS Benchmark figures below show the response times and efficiency of each engine by calculating the average over roughly 200 simulated user - voice AI agent interactions.

Latency

These two charts isolate the two time periods of voice AI latency: end-to-end response time (VART), and the TTS engine's portion of the pipeline (FTTS).

Voice Assistant Response Time

Voice Assistant Response Time (VART) measures the end-to-end latency from user request to first audio output. VART is the latency users actually experience, as they perceive the total delay until the AI agent starts responding to them. According to turn-taking research in spoken dialogue systems, 200ms feels instantaneous; above 1,000ms (1 second), the interaction feels broken.

Only Picovoice Orca achieves instantaneous response under the 200ms threshold. ElevenLabs Streaming is the only other engine under the 1,000ms "natural interaction" threshold. ESpeak-NG is the second-best on-device engine after Orca at 1,500ms, matching the second-best cloud engine, ElevenLabs TTS.

Chatterbox-TTS-Turbo (44.8s) and Kitten TTS Nano (10.7s) introduce delays of 10–45 seconds, effectively unusable for real-time voice, because they lack streaming output.

Voice Assistant Response Time

Picovoice Orca204 ms

ElevenLabs Streaming504 ms

ESPEAK-NG1,504 ms

ElevenLabs TTS1,548 ms

Piper TTS1,587 ms

Amazon Polly1,614 ms

Azure TTS1,656 ms

Soprano TTS1,665 ms

Pocket TTS1,744 ms

Supertonic TTS 22,526 ms

Neu TTS Nano Q4 GGUF2,629 ms

OpenAI TTS2,925 ms

Kokoro TTS3,000 ms

Kitten TTS Nano 0.8 INT8*10,741 ms

Chatterbox TTS Turbo*44,804 ms

* Values capped at 4,000ms

First Token to Speech

First Token to Speech Latency (FTTS) measures the time from the LLM's first output token to the first byte of speech audio. It’s the cleanest comparison of TTS engine responsiveness, removing the LLM factor from VART.

Picovoice Orca leads with 128ms, 2.6× faster than the nearest competitor (ElevenLabs Streaming at 335ms) and ~11× faster than both standard cloud APIs and ESpeak-NG (1,430ms), the fastest on-device open-source alternative.

First Token to Speech

Orca TTS Streaming128 ms

ElevenLabs TTS Streaming335 ms

ESpeak TTS1,430 ms

ElevenLabs TTS1,470 ms

Piper TTS1,510 ms

Amazon Polly1,540 ms

Azure TTS1,580 ms

Soprano TTS1,590 ms

Pocket TTS1,670 ms

Supertonic TTS2,450 ms

Neu TTS Nano Q4 GGUF2,550 ms

OpenAI TTS2,850 ms

Kokoro TTS2,925 ms

Kitten TTS Nano 0.8 INT8*10,670 ms

Chatterbox TTS Turbo*44,710 ms

* Values capped at 4,000ms

FTTS and VART differ only by the constant TTFT (time to first LLM token) across all engines. The near-identical rankings confirm the TTS engine, not the LLM, is the dominant contributor to voice agent latency.

Efficiency

All efficiency metrics are measured on AMD Ryzen 7 5700X with 64GB RAM.

CPU Core Hour Ratio

CPU Core Hour Ratio measures the compute cost required to synthesize one hour of audio. Picovoice Orca demonstrates unmatched efficiency at 0.16 core-hours: the closest alternative, Pocket-TTS, requires more than 2.3× the resources, and Chatterbox-TTS-Turbo requires nearly 120× the resources Orca does.

Put concretely: Orca synthesizes an hour of speech using just 9.6 CPU-minutes, 16% of one core for real-time playback. Chatterbox-TTS-Turbo needs 19 full CPU cores at 100% utilization for the same output.

Core Hour Ratio

Orca Streaming TTS0.16x

Pocket TTS0.37x

Piper TTS0.54x

Supertonic TTS 20.84x

Kokoro TTS1.4x

Kitten Nano TTS5.1x

Soprano TTS5.7x

Neu TTS Nano*9.8x

Chatterbox TTS Turbo*19x

Peak Memory Usage

Peak memory usage shows the maximum memory consumption during audio synthesis.

Picovoice Orca uses 29MB peak memory, suitable for any deployment environment. Kitten TTS Nano, the second most efficient alternative, requires 320MB (~11× Orca) and is viable for certain mobile devices, whereas Chatterbox-TTS-Turbo at 7,500MB (7.5GB, ~260× Orca) is not suitable for resource-constrained environments.

Peak Memory Usage

Orca Streaming TTS29 MB

Kitten TTS Nano320 MB

Supertonic TTS 2520 MB

Pocket TTS617 MB

Soprano TTS710 MB

Kokoro TTS1.9 GB

Neu TTS Nano2.1 GB

Piper TTS2.6 GB

Chatterbox TTS Turbo*7.5 GB

Engines fall into four viable deployment buckets based on peak memory: <30 MB (suitable for all platforms including embedded and web), 300–600 MB (mid-range mobile and desktop), 600–750 MB (high-end mobile only), and >1 GB (desktop/server only). Sample audio by bucket is compared in the Audio Quality section below.

Model Size

Model size affects application download size and storage requirements — particularly important for mobile and web applications, where lean binaries improve the user's first experience.

Orca's 7 MB model is 425× smaller than Chatterbox-TTS-Turbo's 2,980 MB, the largest gap of any metric in this benchmark. Small model size enables over-the-air updates, reduces first-launch download size, and fits easily on storage-constrained hardware such as Raspberry Pi and older mobile devices.

ModelSize

Picovoice Orca7 MB

Kitten TTS Nano 0.8 INT842 MB

Piper TTS61 MB

Pocket TTS242 MB

Supertonic TTS 2262 MB

Soprano TTS280 MB

Kokoro TTS341 MB

Neu TTS Nano Q4 GGUF507 MB

Chatterbox TTS Turbo2,980 MB

Across all three efficiency metrics, Picovoice Orca places first with a 2–11× lead over the next-best alternative. No other engine leads on more than one efficiency metric.

Audio Quality

Audio quality can be evaluated through direct listening comparison using the same test utterances across all engines. Since naturalness and audio quality are subjective, no automated metric is used. Sample audio files are grouped by peak memory usage, which is the primary constraint determining which engines are deployable on a given device. An engine with excellent audio quality but a 2 GB memory footprint is not a viable choice for mobile or embedded applications, regardless of how natural it sounds.

Group 1: Peak Memory Usage < 30 MB

Picovoice Orca and ESpeak-NG require less than 30 MB of memory while running, making them suitable for all types of applications and platforms, including resource-constrained embedded systems and web browsers.

Picovoice Orca

ESpeak-NG

Group 2: Peak Memory Usage 300 MB – 600 MB

Kitten-TTS-Nano-0.8-INT8 and Supertonic-TTS-2 fall in this category. They’re suitable for mid-range and above mobile and desktop applications, but they exceed standard embedded headroom.

Kitten-TTS-Nano-0.8-INT8

Supertonic-TTS-2

Group 3: Peak Memory Usage 600 MB – 750 MB

TTS engines, Pocket-TTS and Soprano-TTS, in this group can run on high-end mobile devices and standard desktop environments. They’re not a fit for legacy or mid-range mobile devices and embedded.

Pocket-TTS

Soprano-TTS

Group 4: Peak Memory Usage > 1 GB

This group of TTS engines, Kokoro-TTS, Neu-TTS-Nano-Q4-GGUF, Piper-TTS, and Chatterbox-TTS-Turbo, requires powerful devices. They may hinder the UX when used in real-time applications to run on average consumer devices.

Kokoro-TTS

Neu-TTS-Nano-Q4-GGUF

Piper-TTS

Chatterbox-TTS-Turbo

Platform Support

Platform support covers the operating systems, runtimes, and hardware targets officially supported by each engine. Picovoice Orca is the only engine in this benchmark with native production SDKs across Linux, macOS, Windows, Android, iOS, Raspberry Pi 3/4/5, and all major browsers simultaneously.

Cloud TTS APIs (Amazon Polly, Azure TTS, ElevenLabs TTS, ElevenLabs Streaming, OpenAI TTS) are excluded — they run on vendor infrastructure and have no on-device deployment.

Platform	Picovoice Orca	Kitten TTS Nano INT8	Supertonic TTS 2	Neu TTS Nano Q4	Kokoro TTS	Chatterbox Turbo	ESpeak NG	Pocket TTS	Piper TTS	Soprano TTS
Desktop
Linux (x86_64)	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
macOS (x86_64)	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
macOS (arm64)	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Windows (x86_64)	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Windows (arm64)	✓	✓	✓	✓	–	✓	–	–	–	–
Mobile
Android	✓	✓	✓	✓	–	–	✓	–	–	–
iOS	✓	✓	✓	✓	–	–	–	–	–	–
Embedded
Raspberry Pi 3	✓	✓	✓	–	–	–	–	–	–	–
Raspberry Pi 4	✓	✓	✓	–	–	–	–	–	–	–
Raspberry Pi 5	✓	✓	✓	–	–	–	–	–	–	–
Browsers
Chrome	✓	✓	✓	–	✓	–	–	–	–	–
Safari	✓	✓	✓	–	✓	–	–	–	–	–
Firefox	✓	✓	✓	–	✓	–	–	–	–	–
Edge	✓	✓	✓	–	✓	–	–	–	–	–

✓ Supported

– Not supported

Usage

The data and code used to create this benchmark are available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents:

Was this doc helpful?

Issue with this doc?