The adoption of voice user interfaces (VUIs) on mobile platforms has increased rapidly in recent years. This surge is by voice assistants created by mobile platform owners Apple, Google, and Samsung.
Despite voice assistant ubiquity, third-party mobile apps struggle to offer compelling voice-enabled experiences tailored to their domains of interest. Developing voice recognition technology is complex and out of reach for most companies. Additionally, platform owners provide limited programmatic access to their voice technology, which can subsequently limit app user experience. And although there are several FOSS (free and open-source software) and commercial voice recognition technologies available, each has obstacles when used on mobile platforms.
Below we will survey popular options for adding voice interfaces to a mobile app, starting with cross-platform technologies and then exploring platform-specific solutions. Compare cloud APIs (Amazon Lex, Google Dialogflow, IBM Watson), Picovoice, PocketSphinx, Kaldi, Mozilla DeepSpeech, Rasa, iOS Speech Recognition API, SiriKit, Android SpeechRecognizer, and Google Assistant Actions across various factors including accuracy, flexibility and cost before building voice UIs on mobile.
Cloud-based solutions such as Amazon Lex, Google Dialogflow, and IBM Watson are easy to integrate, have high accuracy, and support multiple languages. The leading cloud providers offer combined Speech-to-Text (STT) and Natural Language Understanding (NLU) capabilities with (limited) STT tuning for improved performance.
There are two drawbacks with cloud voice recognition: lack of privacy and high operating cost. The voice data goes out of the device, which raises privacy concerns. Data transfer in certain use cases, such as health care, may breach regulatory compliance. Cloud providers charge per API call! This can become unbearably expensive as you gain customer traction. For example, Google Dialogflow charges $0.0065 per API call. A voice-enabled application with 4 voice interactions per day costs $9.49 per user per year, while 100 voice interactions a day pushes the cost up to $237.25. The figure below provides a more comprehensive comparison of Picovoice vs Amazon Lex, Microsoft LUIS, Google Dialogflow, and IBM Watson.
The need for connectivity makes it impossible to provide a reliable and responsive experience as the existence and quality of an Internet connection are not guaranteed. Finally, cloud-based solutions are particularly inefficient in terms of battery power consumption. They need to stream data to and from the device, which limits their usefulness in always-listening and long-running applications.
PocketSphinx can run offline on Android and iOS devices. Its main drawback is low recognition accuracy. Additionally, PocketSphinx only offers STT capability. The inference of intent from transcribed text needs to be performed separately by NLU software which often results in reduced performance from compounded errors.
Kaldi is a speech recognition toolkit intended for use by speech recognition researchers. It is possible to train highly-accurate models using Kaldi and then optimize the implementation for running on ARM-based Android and iOS devices.
The main drawback of Kaldi is its steep learning curve and lack of production-ready code. A company that wants to use Kaldi needs to hire speech scientists and C++ programmers. It also needs to invest significant upfront costs and time for data gathering, training the model and optimizing the runtime engine for mobile execution. Additionally, Kaldi only offers STT capability. The inference of intent from the transcribed text requires an NLU engine.
Mozilla DeepSpeech & Rasa
Mozilla DeepSpeech is an STT engine supported by Mozilla, and Rasa is an upcoming NLU framework. Both frameworks are widely used but are too heavyweight to run on mobile devices at the time of this writing.
Picovoice Rhino Speech-to-Intent engine is a bespoke jointly-optimized STT and NLU, tuned for the domain of interest. It is an end-to-end model that outperforms alternatives in accuracy and runtime efficiency and can work offline on any mobile device. Compared to Google’s Dialogflow, it achieves a 40% higher command acceptance rate while requiring less than 2MB of RAM and less than 1% CPU usage on most modern mobile devices.
Rhino Speech-to-Intent is fully customizable for the domain of interest. Combined with the Picovoice Porcupine wake-word engine, it can provide a unique user experience. For example, using Picovoice technology, it is possible to control an app using the command below:
Hey Strava, pause my workout.
While the same command would look like this by using SiriKit:
Hey Siri, pause my workout in Strava.
Speech Recognition API
The iOS Speech Recognition API is a free cloud-based STT service offered by Apple. Although free, it comes with a set of stringent usage rate limitations. Also, intent inference from transcribed text needs separate handling. This separation results in subpar performance, especially for domain-specific vocabulary.
SiriKit can is API access to some functionalities of Siri for third party apps. It offers an end-to-end speech intent inference solution and is free. SiriKit offers a limited number of domains, such as music and fitness. Hence, if a third party app doesn't fall into existing domains, it likely cannot make use of it. Additionally, apps that can use existing domains cannot customize them based on their needs or brand. For example: the wake phrase is always Hey Siri and cannot be changed.
The Android SpeechRecognizer transcribes text from an audio stream. It is free to use without any usage limitations. Offline capability is available on the latest Google Pixel phones. It can be added manually by the end user to some older Android devices. Intent inference needs to happen separately. Hence, the performance can be subpar, especially for domain-specific vocabulary.
Google Assistant Actions
Google Assistant Actions is a free service that performs end-to-end intent inference accurately. Similar to SiriKit, it has a limited number of predefined domains. If an app does not belong to one of the existing ones, it cannot make use of Google Assistant Actions. The lack of customizations can result in a poor user experience as the user has to leave the app. There is limited customizability as the wake phrase needs to be “OK Google”. Finally, there is no support for offline operation.