Building real-time voice interfaces on iOS presents a significant challenge for enterprise developers. A major limitation on iOS is that Apple's native AVSpeechSynthesizer cannot process streaming text input—it requires the full text before generating any speech. This makes it incapable of properly handling token-by-token or partial LLM outputs, which are essential for dual-streaming scenarios where text and audio are generated concurrently.
Cloud-based Text-to-Speech (TTS) services like Amazon Polly, Azure TTS, ElevenLabs TTS, and OpenAI TTS partially address this, but they often introduce delays of up to ~2000 ms that break conversational flow and make reading live outputs from large language models (LLMs) feel sluggish. Cloud dependency also restricts applications that require strict data privacy and reliable internet connectivity.
The solution is on-device, streaming speech synthesis with Orca Streaming Text-to-Speech. Orca is capable of processing streaming input—producing incremental audio as text arrives, enabling near-instant responses and natural conversational latency. On-device, streaming voice generation is ideal for real-time assistants, accessibility tools, live translation, and any application that needs to narrate LLM output as it streams—such as from picoLLM On-device LLM Inference.
What you'll build:
This tutorial demonstrates how to implement an iOS Streaming Text-to-Speech system using the Orca Streaming Text-to-Speech iOS SDK for voice generation combined with AVAudioEngine for real-time audio playback.
Key benefits for enterprise developers:
- Ultra-low latency: Audio plays immediately as text is processed, rather than waiting for the full input
- On-device processing: Sensitive data stays on-device, and applications maintain performance even in environments with spotty internet
- LLM-ready: Stream real-time voice directly from language model outputs
How to Implement Streaming TTS on iOS
Prerequisites
Before starting, ensure you have:
- Xcode
- iOS device or emulator (iOS 16.0 or higher)
- Swift Package Manager or CocoaPods
- Picovoice Account and
AccessKey
1. Add Orca Library and Model File
Create a SwiftUI project in Xcode. This tutorial uses ContentView.swift as the main interface for the application.
1a. Add Orca to Your Project
Use Swift Package Manager:
Or use CocoaPods:
1b. Add Your Orca Model File
Orca uses model files (.pv) for different languages and voices.
- Download the desired model from the Orca GitHub repository. Filenames indicate language and speaker gender.
- Add the file as a bundled resource:
Build Phases → Copy Bundle Resources.
2. Implement Voice Generation with Orca
2a. Initialize Orca
Initialize an instance of Orca with your AccessKey and model file:
2b. Open a Streaming Instance
Create an OrcaStream object to prepare for streaming synthesis:
2c. Set up Thread-safe PCM Queue
In later steps, we'll set up our audio pipeline so that speech synthesis runs on one thread while audio playback runs on another. This will allow playback to start immediately as soon as PCM data generated by Orca becomes available.
To prepare for this, create a thread-safe queue to safely pass PCM data between these threads:
2d. Synthesize Text in Chunks
Pass text incrementally to stream.synthesize() as they become available:
OrcaStream automatically buffers small chunks of text until it has enough context to synthesize speech audio.
synthesize()returnsnilifOrcaneeds more text to generate audio.- Call
flush()after passing all text to ensure that any remaining buffered text is synthesized. - PCM audio chunks are added to a queue for playback, allowing the audio to be played while more text is being synthesized.
3. Audio Playback with AVAudioEngine
Orca outputs mono, 16-bit PCM, with a sample rate of 22050 Hz. On iOS, the following components enable real-time playback:
- AVAudioEngine: Audio playback pipeline
- AVAudioPlayerNode: Streams PCM buffers incrementally
- PCMQueue: Thread-safe queue for synthesized audio chunks
3a. Configure Audio Session
Set up AVAudioSession for audio playback.
3b. Schedule PCM Buffers
Incrementally feed PCM buffers into AVAudioPlayerNode for real-time audio playback.
4. Stop & Clean Up Resources
When done with audio streaming, clean up resources to prevent memory leaks:
Complete SwiftUI Example: On-device TTS
The following SwiftUI view demonstrates:
- Initializing
Orca - Streaming TTS from a text field
- Incremental PCM playback
- Thread-safe PCM queue management
Replace ${ORCA_MODEL_FILE} with your model file (.pv) and ${ACCESS_KEY} with your Picovoice AccessKey.
For a complete iOS application, see the Orca Streaming Text-to-Speech iOS demo on GitHub.
Explore our documentation for more details:
Troubleshooting
- Initialization fails: Ensure the model file has been correctly bundled as a resource via Build Phases → Copy Bundle Resources.
- No audio output: Verify your device's volume, audio routing, and that the
AVAudioFormatsample rate and channel configuration matchesOrca's output (mono, 16-bit PCM, with a sample rate of 22050 Hz). - Latency or gaps in streaming: Use proper queue management. Ensure your text chunks are passed as soon as they are available and
flush()is called when the stream completes.
Next Steps
Optimize Streaming TTS on iOS in Production
- Audio focus: Ensure your app handles interruptions smoothly with AVAudioSession
- Threading: Cancel synthesis tasks on view dismissal; clear audio queues to prevent playback after exit
- Error handling: Display user-friendly errors and log failures for analytics
- Multi-language support: Use multiple model files for different voices/languages
- Custom pronunciations:
Orca Streaming TTSsupports custom pronunciations
Expand Your Application
- Pair streaming TTS with real-time transcription using Cheetah Streaming Speech-to-Text to enable conversational voice interfaces
- Add picoLLM On-device LLM Inference to build enterprise-grade voice assistants
- Integrate with other iOS speech recognition engines to build a complete, end-to-end voice AI application.
With Orca Streaming Text-to-Speech and AVAudioEngine, iOS developers can implement secure, low-latency streaming TTS, suitable for enterprise apps, accessibility, and live LLM voice output.







