iOS Real-Time TTS: Streaming Text-to-Speech Tutorial [2025]

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI applications running entirely on mobile without sharing user data with 3rd parties.

Building real-time voice interfaces on iOS presents a significant challenge for enterprise developers. A major limitation on iOS is that Apple's native AVSpeechSynthesizer cannot process streaming text input—it requires the full text before generating any speech. This makes it incapable of properly handling token-by-token or partial LLM outputs, which are essential for dual-streaming scenarios where text and audio are generated concurrently.

Cloud-based Text-to-Speech (TTS) services like Amazon Polly, Azure TTS, ElevenLabs TTS, and OpenAI TTS partially address this, but they often introduce delays of up to ~2000 ms that break conversational flow and make reading live outputs from large language models (LLMs) feel sluggish. Cloud dependency also restricts applications that require strict data privacy and reliable internet connectivity.

The solution is on-device, streaming speech synthesis with Orca Streaming Text-to-Speech. Orca is capable of processing streaming input—producing incremental audio as text arrives, enabling near-instant responses and natural conversational latency. On-device, streaming voice generation is ideal for real-time assistants, accessibility tools, live translation, and any application that needs to narrate LLM output as it streams—such as from picoLLM On-device LLM Inference.

What you'll build:

This tutorial demonstrates how to implement an iOS Streaming Text-to-Speech system using the Orca Streaming Text-to-Speech iOS SDK for voice generation combined with AVAudioEngine for real-time audio playback.

Key benefits for enterprise developers:

Ultra-low latency: Audio plays immediately as text is processed, rather than waiting for the full input
On-device processing: Sensitive data stays on-device, and applications maintain performance even in environments with spotty internet
LLM-ready: Stream real-time voice directly from language model outputs

How to Implement Streaming TTS on iOS

Prerequisites

Before starting, ensure you have:

Xcode
iOS device or emulator (iOS 16.0 or higher)
Swift Package Manager or CocoaPods
Picovoice Account and AccessKey

1. Add Orca Library and Model File

Create a SwiftUI project in Xcode. This tutorial uses ContentView.swift as the main interface for the application.

1a. Add Orca to Your Project

Use Swift Package Manager:

https://github.com/Picovoice/orca

Or use CocoaPods:

pod 'Orca-iOS'

1b. Add Your Orca Model File

Orca uses model files (.pv) for different languages and voices.

Download the desired model from the Orca GitHub repository. Filenames indicate language and speaker gender.
Add the file as a bundled resource: Build Phases → Copy Bundle Resources.

2. Implement Voice Generation with Orca

2a. Initialize Orca

Initialize an instance of Orca with your AccessKey and model file:

let accessKey = "${ACCESS_KEY}" // AccessKey from Picovoice Console
let modelFile = "${ORCA_MODEL_FILE}" // e.g. "orca_params_en_female"
let modelPath = Bundle.main.path(forResource: modelFile, ofType: "pv")!
let orca = try Orca(accessKey: accessKey, modelPath: modelPath)

2b. Open a Streaming Instance

Create an OrcaStream object to prepare for streaming synthesis:

let orcaStream = try orca.streamOpen()

2c. Set up Thread-safe PCM Queue

In later steps, we'll set up our audio pipeline so that speech synthesis runs on one thread while audio playback runs on another. This will allow playback to start immediately as soon as PCM data generated by Orca becomes available.

To prepare for this, create a thread-safe queue to safely pass PCM data between these threads:

final class PCMQueue {
    private var queue: [Data] = []
    private let lock = DispatchQueue(label: "pcm.queue")

    // Orca generates and appends PCM data
    func enqueue(_ data: Data) {
        lock.async { self.queue.append(data) }
    }

    // Audio playback engine retrieves PCM data for playback
    func dequeue() -> Data? {
        var result: Data?
        lock.sync { if !queue.isEmpty { result = queue.removeFirst() } }
        return result
    }
}

2d. Synthesize Text in Chunks

Pass text incrementally to stream.synthesize() as they become available:

while isStreamingText {
    let pcm = try? stream.synthesize(text: token)
    if pcm != nil {
        enqueue(pcm!) // add PCM to PCM queue
    }
}

let flushed = try? stream.flush()
if flushed != nil {
    enqueue(flushed!)
}

OrcaStream automatically buffers small chunks of text until it has enough context to synthesize speech audio.

synthesize() returns nil if Orca needs more text to generate audio.
Call flush() after passing all text to ensure that any remaining buffered text is synthesized.
PCM audio chunks are added to a queue for playback, allowing the audio to be played while more text is being synthesized.

3. Audio Playback with AVAudioEngine

Orca outputs mono, 16-bit PCM, with a sample rate of 22050 Hz. On iOS, the following components enable real-time playback:

AVAudioEngine: Audio playback pipeline
AVAudioPlayerNode: Streams PCM buffers incrementally
PCMQueue: Thread-safe queue for synthesized audio chunks

3a. Configure Audio Session

Set up AVAudioSession for audio playback.

let session = AVAudioSession.sharedInstance()
try session.setCategory(.playback, mode: .default)
try session.setActive(true)

3b. Schedule PCM Buffers

Incrementally feed PCM buffers into AVAudioPlayerNode for real-time audio playback.

while isStreamingAudio {
  let data = pcmQueue.dequeue()
  
  // Create an AVAudioFormat matching Orca output
  let sampleRate = Double(orca?.sampleRate ?? 22050)
  let frameCount = data.count / MemoryLayout<Float>.size
  guard let format = AVAudioFormat(standardFormatWithSampleRate: sampleRate, channels: 1),
        let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(frameCount)) else { continue }
  buffer.frameLength = AVAudioFrameCount(frameCount)
  
  // Copy Float32 PCM data into AVAudioPCMBuffer
  data.withUnsafeBytes { rawPtr in
      let ptr = rawPtr.bindMemory(to: Float.self)
      buffer.floatChannelData![0].update(from: ptr.baseAddress!, count: frameCount)
  }
  
  // Schedule buffer for playback
  player.scheduleBuffer(buffer)
}

4. Stop & Clean Up Resources

When done with audio streaming, clean up resources to prevent memory leaks:

player.stop()
engine.stop()

orcaStream?.close()
orcaStream = nil
orca?.delete()
orca = nil

Complete SwiftUI Example: On-device TTS

The following SwiftUI view demonstrates:

Initializing Orca
Streaming TTS from a text field
Incremental PCM playback
Thread-safe PCM queue management

Replace ${ORCA_MODEL_FILE} with your model file (.pv) and ${ACCESS_KEY} with your Picovoice AccessKey.

import SwiftUI
import AVFoundation
import Orca

enum AppState {
    case initial    // Ready to initialize Orca
    case loading    // Initialization in progress
    case ready      // User can type input and start streaming
    case streaming  // Orca is synthesizing text and audio is playing
}

// Thread-safe PCM queue
final class PCMQueue {
    private var queue: [Data] = []
    private let lock = DispatchQueue(label: "pcm.queue")

    // Add PCM data to the queue asynchronously
    func enqueue(_ data: Data) {
        lock.async { self.queue.append(data) }
    }

    // Retrieve PCM data from the queue synchronously
    func dequeue() -> Data? {
        var result: Data?
        lock.sync { if !queue.isEmpty { result = queue.removeFirst() } }
        return result
    }

    var isEmpty: Bool {
        var empty = true
        lock.sync { empty = queue.isEmpty }
        return empty
    }

    func clear() {
        lock.async { self.queue.removeAll() }
    }
}

struct ContentView: View {
    @State private var state: AppState = .initial
    @State private var errorMsg: String? = nil
    @State private var userText: String = ""

    @State private var orca: Orca? = nil
    @State private var orcaStream: Orca.OrcaStream? = nil

    private let pcmQueue = PCMQueue()
    @State private var isQueueing: Bool = false

    private var engine = AVAudioEngine()
    private var player = AVAudioPlayerNode()

    let accessKey = "${ACCESS_KEY}"
    let modelFile = "${ORCA_MODEL_FILE}"

    var body: some View {
        VStack(spacing: 16) {
            if let error = errorMsg {
                Text(error).foregroundColor(.red)
            }

            Button("Initialize Orca") {
                initOrca()
            }
            .disabled(state != .initial)

            TextField("Text to synthesize", text: $userText)
                .textFieldStyle(RoundedBorderTextFieldStyle())
                .padding(.horizontal)
                .disabled(state != .ready)

            HStack {
                Button("Start Streaming") {
                    startStreaming()
                }
                .disabled(state != .ready || userText.isEmpty)

                Button("Stop & Cleanup") {
                    stopEverything()
                }
                .disabled(state != .ready)
            }

            switch state {
            case .loading:
                Text("Initializing…")
            case .streaming:
                Text("Streaming…")
            default:
                EmptyView()
            }
        }
        .padding()
    }

    // Audio session setup
    func setupAudioSession() {
        let session = AVAudioSession.sharedInstance()
        do {
            try session.setCategory(.playback, mode: .default)
            try session.setActive(true)
        } catch {
            print("Audio session setup failed: \(error)")
        }
    }

    // 1. Initialize Orca
    func initOrca() {
        state = .loading
        errorMsg = nil

        DispatchQueue.global().async {
            do {
                let modelPath = Bundle.main.path(forResource: modelFile, ofType: "pv")!
                let instance = try Orca(accessKey: accessKey, modelPath: modelPath)
                
                DispatchQueue.main.async {
                    self.orca = instance
                    self.state = .ready
                }
            } catch {
                DispatchQueue.main.async {
                    self.errorMsg = "Initialization failed: \(error)"
                    self.state = .initial
                }
            }
        }
    }

    // 2. Start Streaming
    func startStreaming() {
        guard let orca = orca else { return }

        do {
            orcaStream = try orca.streamOpen()
        } catch {
            errorMsg = "Failed to open Orca stream: \(error)"
            return
        }

        pcmQueue.clear()
        isQueueing = true
        state = .streaming

        setupAudioSession()
        startAudioEngine()

        Task.detached { await self.synthesizeTextChunks(text: self.userText) }
        Task.detached { await self.playbackLoop() }
    }

    // 3a. Synthesize text into PCM chunks
    func synthesizeTextChunks(text: String) async {
        guard let stream = orcaStream else { return }
        let words = text.split(separator: " ")

        for word in words {
            let chunk = "\(word) "
            let pcm = try? stream.synthesize(text: chunk)
            if pcm != nil {
                enqueue(pcm!)
            }
        }

        let flushed = try? stream.flush()
        if flushed != nil {
            enqueue(flushed!)
        }

        stream.close()

        await MainActor.run { self.isQueueing = false }
    }

    // Enqueue PCM (Int16 → Float32)
    func enqueue(_ arr: [Int16]) {
        let floatPCM = arr.map { Float($0)/Float(Int16.max) }
        let data = floatPCM.withUnsafeBufferPointer { Data(buffer: $0) }
        pcmQueue.enqueue(data)
    }

    // 3b. Playback loop
    func playbackLoop() async {
        let sampleRate = Double(orca?.sampleRate ?? 22050)

        if !player.isPlaying { player.play() }

        var scheduledBuffers = 0
        let lockQueue = DispatchQueue(label: "scheduled.lock")

        // Continue until synthesis is complete AND all PCM has played
        while isQueueing || !pcmQueue.isEmpty {
            guard let data = pcmQueue.dequeue() else {
                try? await Task.sleep(nanoseconds: 2_000_000)
                continue
            }

            // Create an AVAudioFormat matching Orca output
            let frameCount = data.count / MemoryLayout<Float>.size
            guard let format = AVAudioFormat(standardFormatWithSampleRate: sampleRate, channels: 1),
                  let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(frameCount)) else { continue }
            buffer.frameLength = AVAudioFrameCount(frameCount)

            // Copy Float32 PCM data into AVAudioPCMBuffer
            data.withUnsafeBytes { rawPtr in
                let ptr = rawPtr.bindMemory(to: Float.self)
                buffer.floatChannelData![0].update(from: ptr.baseAddress!, count: frameCount)
            }

            // Schedule buffer for playback
            lockQueue.async { scheduledBuffers += 1 }
            player.scheduleBuffer(buffer) {
                lockQueue.async { scheduledBuffers -= 1 }
            }
        }

        // Wait for all scheduled buffers to finish playing
        while true {
            var remaining = 0
            lockQueue.sync { remaining = scheduledBuffers }
            if remaining == 0 { break }
            try? await Task.sleep(nanoseconds: 1_000_000)
        }

        player.stop()
        engine.stop()

        await MainActor.run { state = .ready }
    }

    // Audio engine setup
    func startAudioEngine() {
        if !engine.attachedNodes.contains(player) {
            engine.attach(player)
            let format = AVAudioFormat(standardFormatWithSampleRate: Double(orca?.sampleRate ?? 22050), channels: 1)!
            engine.connect(player, to: engine.mainMixerNode, format: format)
        }

        do { try engine.start() } catch { print("Failed to start engine: \(error)") }
    }

    // 4. Stop everything
    func stopEverything() {
        pcmQueue.clear()
        isQueueing = false
        player.stop()
        engine.stop()
        orcaStream?.close()
        orcaStream = nil
        orca?.delete()
        orca = nil
        state = .initial
    }
}

For a complete iOS application, see the Orca Streaming Text-to-Speech iOS demo on GitHub.

Explore our documentation for more details:

Troubleshooting

Initialization fails: Ensure the model file has been correctly bundled as a resource via Build Phases → Copy Bundle Resources.
No audio output: Verify your device's volume, audio routing, and that the AVAudioFormat sample rate and channel configuration matches Orca's output (mono, 16-bit PCM, with a sample rate of 22050 Hz).
Latency or gaps in streaming: Use proper queue management. Ensure your text chunks are passed as soon as they are available and flush() is called when the stream completes.

Next Steps

Optimize Streaming TTS on iOS in Production

Audio focus: Ensure your app handles interruptions smoothly with AVAudioSession
Threading: Cancel synthesis tasks on view dismissal; clear audio queues to prevent playback after exit
Error handling: Display user-friendly errors and log failures for analytics
Multi-language support: Use multiple model files for different voices/languages
Custom pronunciations: Orca Streaming TTS supports custom pronunciations

Expand Your Application

Pair streaming TTS with real-time transcription using Cheetah Streaming Speech-to-Text to enable conversational voice interfaces
Add picoLLM On-device LLM Inference to build enterprise-grade voice assistants
Integrate with other iOS speech recognition engines to build a complete, end-to-end voice AI application.

With Orca Streaming Text-to-Speech and AVAudioEngine, iOS developers can implement secure, low-latency streaming TTS, suitable for enterprise apps, accessibility, and live LLM voice output.

Start Free

Frequently Asked Questions

Why is low latency important for TTS with LLMs?

Low-latency TTS ensures that speech playback starts almost immediately, creating a natural conversational experience when reading large language model outputs or interactive chat responses.

Can I use streaming TTS for multi-language iOS applications?

Yes, by loading different model files, you can support multiple languages and voices, enabling real-time TTS across diverse user bases.

How does incremental audio playback work?

Incremental playback streams small chunks of audio as text is processed, allowing the application to speak immediately without waiting for the full text input.

Is this approach suitable for accessibility features?

Absolutely. Streaming, low-latency TTS provides real-time audio feedback, which is ideal for accessibility tools such as screen readers or assistive voice interfaces.