Speech to Text Transcription in iOS Tutorial

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI applications running entirely on mobile without sharing user data with 3rd parties.

Mobile apps are an ideal use case for Speech Recognition, whether it be for hands-free diction, voice interfaces for mobile games, or generating subtitles for video and audio messages.

Apple devices, such as the iPhone, iPad and Apple Watch are powered by iOS, Apple's popular flagship operating system. iOS features it's own Speech Recognition API, but it can be clumsy and verbose to integrate. Crucially, not all languages it supports have on-device recognition and even those that do may choose to stream audio to Apple's servers, introducing privacy concerns and latency.

Fortunately, Picovoice's Speech-to-Text technology does not have these downsides, and integrates seamlessly into the iOS ecosystem.

In addition to iOS, Picovoice's Speech-to-Text engines are compatible in a wide array of environments, such as Android, Linux, macOS, Windows, and modern web browsers (via WebAssembly).

With Speech-to-Text transcription, there are two main approaches: Real-Time and Batch.

Real-Time Speech-to-Text

Real-time Speech-to-Text systems offer text output in real time as a user speaks, mirroring how humans listen and convert speech into text mentally during conversations. A downside to this method is that it can lead to errors arising from auditory or semantic difficulties, which often only become apparent after a sentence is finished. Therefore, it's crucial to take this drawback into account when determining if an application necessitates real-time transcription.

Real-Time Speech-to-Text, Online Automatic Speech Recognition, and Streaming Speech-to-Text all refer to the same core technology.

For iOS devices, Picovoice provides Cheetah Streaming Speech-to-Text, a unique technology that performs all voice recognition in real-time directly on the device. This approach avoids network-related delays and minimizes the latency between the user's speech input and the transcription output.

Below is the list of software development kits (SDKs) supported by Cheetah, along with corresponding code snippets and quick-start guides.

o = pvcheetah.create(access_key)

partial_transcript, is_endpoint =
  o.process(get_next_audio_frame())
Build with Python

Batch Speech-to-Text

Unlike real-time transcription, Batch Speech-to-Text waits for the complete spoken phrase to complete before providing a transcription. Compared to real-time approaches, this method boasts higher accuracy and runtime efficiency. It can anticipate spoken words, making adjustments for better precision in both linguistic and acoustic aspects. Additionally, it streamlines the process by eliminating the need to switch between listening and transcribing, thus improving overall efficiency.

For iOS-based devices, Picovoice offers Leopard Speech-to-Text, a state-of-the-art technology for batch transcription tasks. Like Cheetah, Leopard processes all voice audio data on device, ensuring privacy by design and compliance with regulations such as HIPAA and GDPR. To further improve accuracy, users can incorporate custom vocabulary and boosting specific phrases via the Picovoice Console.

Below is the list of SDKs supported by Leopard, along with corresponding code snippets and quick-start guides.

o = pvleopard.create(access_key)

transcript, words =
  o.process_file(path)
Build with Python

iOS Speech to Text

Real-Time Speech-to-Text

Batch Speech-to-Text

More from Picovoice