🚀 Best-in-class Voice AI!
Build compliant and low-latency AI applications running entirely on mobile without sharing user data with 3rd parties.
Start Free

Building real-time voice interfaces on iOS presents a significant challenge for enterprise developers. A major limitation on iOS is that Apple's native AVSpeechSynthesizer cannot process streaming text input—it requires the full text before generating any speech. This makes it incapable of properly handling token-by-token or partial LLM outputs, which are essential for dual-streaming scenarios where text and audio are generated concurrently.

Cloud-based Text-to-Speech (TTS) services like Amazon Polly, Azure TTS, ElevenLabs TTS, and OpenAI TTS partially address this, but they often introduce delays of up to ~2000 ms that break conversational flow and make reading live outputs from large language models (LLMs) feel sluggish. Cloud dependency also restricts applications that require strict data privacy and reliable internet connectivity.

The solution is on-device, streaming speech synthesis with Orca Streaming Text-to-Speech. Orca is capable of processing streaming input—producing incremental audio as text arrives, enabling near-instant responses and natural conversational latency. On-device, streaming voice generation is ideal for real-time assistants, accessibility tools, live translation, and any application that needs to narrate LLM output as it streams—such as from picoLLM On-device LLM Inference.

What you'll build:

This tutorial demonstrates how to implement an iOS Streaming Text-to-Speech system using the Orca Streaming Text-to-Speech iOS SDK for voice generation combined with AVAudioEngine for real-time audio playback.

Key benefits for enterprise developers:

  • Ultra-low latency: Audio plays immediately as text is processed, rather than waiting for the full input
  • On-device processing: Sensitive data stays on-device, and applications maintain performance even in environments with spotty internet
  • LLM-ready: Stream real-time voice directly from language model outputs

How to Implement Streaming TTS on iOS

Prerequisites

Before starting, ensure you have:


1. Add Orca Library and Model File

Create a SwiftUI project in Xcode. This tutorial uses ContentView.swift as the main interface for the application.

1a. Add Orca to Your Project

Use Swift Package Manager:

Or use CocoaPods:

1b. Add Your Orca Model File

Orca uses model files (.pv) for different languages and voices.

  1. Download the desired model from the Orca GitHub repository. Filenames indicate language and speaker gender.
  2. Add the file as a bundled resource: Build Phases → Copy Bundle Resources.

2. Implement Voice Generation with Orca

2a. Initialize Orca

Initialize an instance of Orca with your AccessKey and model file:

2b. Open a Streaming Instance

Create an OrcaStream object to prepare for streaming synthesis:

2c. Set up Thread-safe PCM Queue

In later steps, we'll set up our audio pipeline so that speech synthesis runs on one thread while audio playback runs on another. This will allow playback to start immediately as soon as PCM data generated by Orca becomes available.

To prepare for this, create a thread-safe queue to safely pass PCM data between these threads:

2d. Synthesize Text in Chunks

Pass text incrementally to stream.synthesize() as they become available:

OrcaStream automatically buffers small chunks of text until it has enough context to synthesize speech audio.

  • synthesize() returns nil if Orca needs more text to generate audio.
  • Call flush() after passing all text to ensure that any remaining buffered text is synthesized.
  • PCM audio chunks are added to a queue for playback, allowing the audio to be played while more text is being synthesized.

3. Audio Playback with AVAudioEngine

Orca outputs mono, 16-bit PCM, with a sample rate of 22050 Hz. On iOS, the following components enable real-time playback:

3a. Configure Audio Session

Set up AVAudioSession for audio playback.

3b. Schedule PCM Buffers

Incrementally feed PCM buffers into AVAudioPlayerNode for real-time audio playback.


4. Stop & Clean Up Resources

When done with audio streaming, clean up resources to prevent memory leaks:


Complete SwiftUI Example: On-device TTS

The following SwiftUI view demonstrates:

  • Initializing Orca
  • Streaming TTS from a text field
  • Incremental PCM playback
  • Thread-safe PCM queue management

Replace ${ORCA_MODEL_FILE} with your model file (.pv) and ${ACCESS_KEY} with your Picovoice AccessKey.

For a complete iOS application, see the Orca Streaming Text-to-Speech iOS demo on GitHub.

Explore our documentation for more details:

Troubleshooting

  • Initialization fails: Ensure the model file has been correctly bundled as a resource via Build Phases → Copy Bundle Resources.
  • No audio output: Verify your device's volume, audio routing, and that the AVAudioFormat sample rate and channel configuration matches Orca's output (mono, 16-bit PCM, with a sample rate of 22050 Hz).
  • Latency or gaps in streaming: Use proper queue management. Ensure your text chunks are passed as soon as they are available and flush() is called when the stream completes.

Next Steps

Optimize Streaming TTS on iOS in Production

  • Audio focus: Ensure your app handles interruptions smoothly with AVAudioSession
  • Threading: Cancel synthesis tasks on view dismissal; clear audio queues to prevent playback after exit
  • Error handling: Display user-friendly errors and log failures for analytics
  • Multi-language support: Use multiple model files for different voices/languages
  • Custom pronunciations: Orca Streaming TTS supports custom pronunciations

Expand Your Application

With Orca Streaming Text-to-Speech and AVAudioEngine, iOS developers can implement secure, low-latency streaming TTS, suitable for enterprise apps, accessibility, and live LLM voice output.

Start Free

Frequently Asked Questions

Why is low latency important for TTS with LLMs?
Low-latency TTS ensures that speech playback starts almost immediately, creating a natural conversational experience when reading large language model outputs or interactive chat responses.
Can I use streaming TTS for multi-language iOS applications?
Yes, by loading different model files, you can support multiple languages and voices, enabling real-time TTS across diverse user bases.
How does incremental audio playback work?
Incremental playback streams small chunks of audio as text is processed, allowing the application to speak immediately without waiting for the full text input.
Is this approach suitable for accessibility features?
Absolutely. Streaming, low-latency TTS provides real-time audio feedback, which is ideal for accessibility tools such as screen readers or assistive voice interfaces.