Voice UI on Mobile: Challenges and OpportunitiesOctober 10, 2019
Adoption of voice interfaces on mobile platforms has grown rapidly in recent years. This surge is almost entirely driven by voice assistants created by mobile platform owners Apple, Google, and Samsung.
Despite voice assistant ubiquity, third party mobile apps are struggling to offer compelling voice-enabled experiences tailored for their domains of interest. Developing voice recognition technology is complex and out of reach for the majority of companies. Additionally, platform owners provide only a limited programmatic access to their voice technology, which can subsequently limit app user experience. And although there are several open-source and commercial voice recognition technologies available, each has obstacles when used on mobile platforms.
Below we will survey popular options for adding voice interfaces to a mobile app, starting with cross-platform technologies and then exploring platform-specific solutions.
Cloud-based solutions such as Amazon Lex, Google Dialogflow, and IBM Watson are easy to integrate, have high accuracy, and support multiple languages. The leading cloud providers offer combined speech-to-text (STT) and natural language understanding (NLU) capabilities with (limited) STT tuning for improved performance.
There are two major drawbacks with cloud voice recognition: lack of privacy and high operating cost. The voice data needs to be transferred out of the device, which in turn raises privacy concerns and in certain use cases—such as health care—may breach regulatory compliance. Most providers charge a minimum of $0.004 per API call  . Assuming 10 voice interactions per user per day, this adds up to $14 per user per year.
The need for connectivity makes it impossible to provide a reliable and responsive experience as the existence and quality of an Internet connection is not guaranteed. Finally, cloud-based solutions are particularly inefficient in terms of battery power consumption as they need to continuously stream data to and from the device, which in turn limits their usefulness in always-listening and long running applications.
PocketSphinx can run offline on Android and iOS devices. Its main drawback is low recognition accuracy . Additionally, PocketSphinx only offers STT capability and the inference of intent from transcribed text needs to be performed separately by NLU software. This often results in reduced performance from compounded errors.
Kaldi is a speech recognition toolkit intended for use by speech recognition researchers. It is possible to train highly-accurate models using Kaldi and then optimize the implementation for running on ARM-based Android and iOS devices.
The main drawback of Kaldi is its steep learning curve and lack of production-ready code. In practice, a company seeking to use Kaldi needs to hire speech scientists, C++ programmers, and invest significant upfront cost and time for data gathering, training the model, and optimizing the runtime engine for mobile execution. Additionally, Kaldi only offers STT capability and inference of intent from the transcribed text needs to be performed separately.
Mozilla DeepSpeech & Rasa
Mozilla DeepSpeech is an STT engine supported by Mozilla, and Rasa is an upcoming NLU framework. Both frameworks are widely used but are too heavyweight to run on mobile devices at the time of this writing.
Picovoice’s Speech-to-Intent engine is a bespoke jointly-optimized STT and NLU engine, tuned for the domain of interest. The result of this optimization: an end-to-end model that outperforms alternatives in accuracy and runtime efficiency and can work offline on any mobile device. Compared to Google’s Dialogflow, it achieves 40% higher command acceptance rate while requiring less than 2MB of RAM and less than 1% CPU usage on most modern mobile devices  .
Picovoice’s speech-to-intent is fully customizable for the domain of interest and combined with the Picovoicewake-word engine can provide a unique user experience. For example, using Picovoice technology it is possible to control an app using the command below:
Hey Strava, pause my workout.
While the same command would look like this by using SiriKit:
Hey Siri, pause my workout in Strava.
The major limitation of Picovoice is that it only supports English at the time of this writing.
Speech Recognition API
The iOS Speech Recognition API is a free cloud-based STT service offered by Apple. Although free, it comes with a set of stringent usage rate limitations. Also, intent inference from transcribed text needs to be performed separately, which results in subpar performance—especially in the presence of domain-specific vocabulary.
SiriKit can be thought as API access to some of the functionalities of Siri for third party apps. It offers an end-to-end speech intent inference solution and is free.
SiriKit offers a limited number of domains such as music and fitness. Hence, If the third party app doesn't fall into an existing domain, it likely cannot make use of it. Additionally, apps that can make use of existing domains cannot customize them based on their needs or brand. For example: the wake phrase is always “Hey Siri” and cannot be changed. At the time of this writing, Siri requires Internet connectivity.
The Android SpeechRecognizer transcribes text from a live stream of audio; it is free to use without any usage limitations. Offline capability is available on the latest Google Pixel phones and can be added manually by the end user to some older Android devices. Intent inference needs to happen separately and hence the performance can be subpar, especially when there is domain specific vocabulary.
Google Assistant Actions
Google Assistant Actions is a free service that performs end-to-end intent inference and is with good accuracy. Similar to SiriKit, it has a limited number of predefined domains:if an app does not belong to one of the existing ones, cannot make use of Google Assistant Actions. This can result in a lousy user experience as the user might be forced to leave the app. There is limited room for customizability as the wake phrase needs to be “OK Google”. Finally, there is no support for offline operation.
The table below summarizes the pros and cons of all of the approaches discussed in this article:
|Cloud||PocketSphinx||Kaldi||Picovoice||iOS STT||SiriKit||Android STT||Google Actions|
|Cost of Integration||Low||Medium||High||Low||Low||Low||Low||Low|
|Custom Wake Word||No||Yes||Yes||Yes||No||No||No||No|