Streaming Text-to-Speech (TTS) enables Android apps to generate and play audio incrementally as text arrives, which is essential for real-time voice interfaces, accessibility features, and conversational assistants.
A major limitation on Android is that the native TextToSpeech API cannot produce streaming or token-by-token audio. It requires complete text before synthesis, making it unsuitable for real-time applications like handling partial LLM outputs.
Cloud-based TTS services such as Amazon Polly, Azure TTS, ElevenLabs, and OpenAI TTS introduce additional issues: network latency, dependency on connectivity, and privacy concerns. Even the fastest cloud engines can add hundreds to thousands of milliseconds of delay, whereas on-device TTS begins synthesizing immediately and delivers audio 6.5x faster than the closest competitor (ElevenLabs).
The solution is on-device, streaming speech synthesis with Orca Streaming Text-to-Speech. This tutorial demonstrates how to build real-time speech generation on Android using Orca for voice synthesis and Android's AudioTrack API for PCM audio streaming. The approach works with any streaming text source—including live LLM output (ChatGPT, Claude, or picoLLM On-device LLM Inference) or dynamically generated content.
What you'll learn:
- Initialize an on-device TTS engine in Android
- Stream text to the TTS engine to generate real-time speech (PCM data)
- Handle PCM audio playback with
AudioTrack
Key benefits for enterprise developers:
- Low-latency streaming: Audio plays as text arrives; 130 ms first-word latency
- On-device processing: Runs in environments with unreliable network connectivity
- Flexible text sources: Works with LLMs, user input, or any streaming text source
How to Build Streaming TTS on Android
Prerequisites
Before you begin, make sure you have the following:
- Android Studio
- Android device or emulator (Android 7.0 API 24 or higher)
- USB debugging enabled on your Android device
- Picovoice Account and AccessKey
1. Project Setup
This tutorial demonstrates a project built with Kotlin with Jetpack Compose, targeting Android 15 (API level 35) with a minimum supported version of Android 7.0 (API level 24).
Add Internet Permission
Include this in your AndroidManifest.xml:
Orca Streaming Text-to-Speech requires internet connectivity only for authenticating your AccessKey. All speech synthesis runs entirely on-device.
2. Add Orca Library and Model File
2a. Add Orca Library via Maven Central
In app/build.gradle.kts, add orca-android to your dependencies:
Then in gradle/libs.versions.toml (replace {LATEST_VERSION}, e.g. 1.2.0):
Execute a Gradle sync.
2b. Add Orca Model File
Orca uses model files (.pv) for different languages and voices.
- Download your desired model from the Orca GitHub repository. The filename indicates the language and gender of the speaker.
- Place the model in your Android project under:
{ANDROID_APP}/src/main/assets
3. Implement Speech Synthesis with Orca
3a. Initialize Orca
Use Orca.Builder to create an Orca instance:
3b. Create OrcaStream
Open an OrcaStream object:
Optionally, OrcaSynthesizeParams.Builder can be used to configure settings such as speech rate.
3c. Streaming Text to Speech
We'll simulate text streaming by looping through an array of words and passing each word to Orca one at a time:
Orca synthesizes speech from text incrementally using a streaming interface. Orca buffers incoming text internally until it has enough context to generate speech.
synthesize()returnsnullifOrcaneeds more text to generate audio.- Call
flush()after passing all text to ensure that any remaining buffered text is synthesized. - PCM audio chunks are added to a queue for playback, allowing the audio to be played while more text is still being synthesized.
4. Playing Synthesized Speech with AudioTrack
Once you have PCM audio chunks in a queue, you can play them using AudioTrack, which streams raw PCM audio to the device's speakers.
4a. Configure AudioTrack
Orca outputs mono, 16-bit PCM, with a sample rate of 22050 Hz, which is a common format for speech synthesis. Using AudioTrack in streaming mode allows you to play audio chunks incrementally, keeping latency low.
Explanation of key settings:
ENCODING_PCM_16BIT: MatchesOrca's 16-bit PCM output.CHANNEL_OUT_MONO: Single-channel audio for voice playback; matchesOrca's mono PCM output.MODE_STREAM: Enables incremental writing of audio data as it's synthesized, instead of buffering everything first.
4b. Play Audio from PCM Queue
Once AudioTrack is configured, you can continuously write PCM chunks from a queue filled by Orca. Using a queue allows synthesis and playback to run simultaneously on separate threads:
Key points:
isQueueing.get(): Ensures playback continues while new audio chunks are being synthesized.pcmQueue.poll(): Fetches the next available PCM chunk for immediate playback.audioTrack.write(): Streams PCM data directly to the audio hardware.
5. Stop & Clean Up Resources
When done, always clean up resources to free memory:
Complete Example: Android Streaming TTS
Below is a simplified but complete example demonstrating:
- State handling (
Initial,Loading,Ready,Streaming) - Buttons to initialize Orca, stream text, and stop/cleanup
- Multithreaded PCM synthesis and playback
Replace {ORCA_MODEL_FILE} with your model file (.pv) and {ACCESS_KEY} with your Picovoice AccessKey.
For a complete Android application, see the Orca Streaming Text-to-Speech Android demo on GitHub.
This tutorial uses the following package:
Explore our documentation for more details:
Troubleshooting
- Initialization fails: Ensure the model file exists in assets and is copied to internal storage.
- No audio output: Verify your device's volume, audio routing, and that the
AudioTracksample rate and channel configuration matchesOrca's output (mono, 16-bit PCM, with a sample rate of 22050 Hz). - Latency or gaps in streaming: Use proper queue management. Ensure your text chunks are passed as incrementally as they become available. Call
flush()when the stream completes.
Next Steps
Optimize Streaming TTS for Production Android Applications
- Permissions: If your app targets Android 12+ or later, review runtime permission requests carefully. While the current example only requires
INTERNETfor authentication, additional network or audio features may require dynamic permission handling. - Audio focus: To avoid conflicts with other audio apps, request
audio focuswhen playing TTS. Consider handling focus loss gracefully (pause/resume) for a better user experience. - Threading and lifecycle management: When streaming is done, cancel background threads and clean up
OrcaandAudioTrackto prevent memory leaks or audio glitches. - Error handling: For production, display user-friendly messages when initialization fails or streaming errors occur.
Further Improvements
Once you have streaming Text-to-Speech implemented, consider building a complete voice AI assistant for Android by integrating:
- Cheetah Streaming Speech-to-Text: for real-time, on-device speech-to-text
- picoLLM On-Device LLM Inference: for on-device LLM inference, enabling live text generation for conversational experiences
With Orca, Cheetah, and picoLLM, you can implement fully on-device voice AI that streams LLM output to TTS with minimal latency, offering a secure, private, and responsive solution suitable for enterprise Android apps.







