Offline Voice AI in a Web Browser

  • Microphone
  • NLU
  • Offline Voice AI
  • Privacy
  • Web Audio API
  • Web Workers
  • WebAssembly
July 09, 2020

Why offline when you're online?

Offline Voice AI in a browser may seem contradictory. Doesn’t using a web browser mean you’re online? But even if you’re connected to the Internet, running Voice AI offline, in-browser, means eliminating both minimum and variable latency, and that the speech data is intrinsically private: it does not need to leave the device.

Offline AI also allows the voice functionality to work consistently with progressive apps. And from a business perspective, it can let you determine cost based on customer seats instead of unbounded number of calls to a cloud provider.

Demo Application: Smart Lighting

Here is a web demonstration application that uses Porcupine (wake word detection) and Rhino (Speech-to-Intent) engines to control smart lights in a home.

Instructions

Getting Started

Welcome to the smart home! Using your voice, you can control various aspects of the lights in this house.

To start any command, say "Pico House" or "Pico Home" to wake the device.

Turning The Lights On/Off

Here are some example phrases:

"Turn the lights in the Living Room on"

"Shut off the lights in the Bathroom"

"Switch on all lights"

Changing Light Color

Here are some example phrases:

"Turn the lights in the Kitchen Blue"

"Make the closet lights green"

"Switch all lights purple"

Possible Colors:
RedOrangeYellowGreenBluePurplePinkWhite

Dimming/Brightening

Here are some example phrases:

"Turn the lights in the hallway up"

"Make the closet lights brighter"

"Dim all the lights"

LIVING ROOMBEDROOMCLOSETBATHROOMPANTRYHALLWAYKITCHEN

All of the speech processing is performed privately, offline, in-browser. You can see more web demos here.

How to embed Voice AI in the browser

To achieve offline voice AI, we can use speech technology from Picovoice. Picovoice engines run in modern web browsers via WebAssembly. A typical use case involves providing the engines realtime audio input from the microphone, and receiving events back (e.g. hotword detected).

The core of running any Picovoice engine in the browser is simply three files: a WebAssembly module, its wrapper, and a binding file. The binding provides a factory method for creating an instance of the engine. The engine fires a callback to signal that the WASM has finished loading, accepts frames of audio, and returns the engine results.

We will expand on this foundation with additional steps to setup streaming audio, request microphone permission, and downsample audio to the required format. We’ve provided demo code to assist with those tasks.

Web Audio API and Microphone access

Picovoice uses the Web Audio API to process streaming audio. The MediaDevices API is used to access the microphone. The user must explicitly grant permission to access the microphone before any audio processing can occur. This is mandatory due to browsers’ security model. Additionally, the browser requires an HTTPS connection to allow microphone access (for testing locally, HTTP will suffice).

Web Voice Processor

Picovoice engines accept industry standard 16KHz audio for speech processing. When accessing microphone audio, we need to downsample it to this format. We also need to ask for permission from the user to access their microphone, and connect the input to the downsampler. To handle these tasks we have provided the Web Voice Processor package. It is also available on npm.

Web Workers

Picovoice recommends using Web Workers to run the engines in the background. Workers separate the speech processing from the main JavaScript thread.

Callbacks and Workers

The demo code includes Porcupine and Rhino web worker scripts. They communicate with managers in the main thread using postMessage(). In addition, the Web Voice Processor performs downsampling via its own dedicated worker.

Open Source Demo Applications

We have provided demo code for the Porcupine and Rhino engines (including running both together) that will run in a recent web browser without transpilation required.

The Porcupine demo allows you to control a smart lamp with always-listening commands.

The Smart Lighting demo uses Rhino to handle smart home lighting with natural language commands like “dim the living room lights”. There’s also an always listening version of this demo that uses Porcupine as a trigger to wake the smart home lighting and then use the natural language commands.

Common Questions

How do I use Picovoice engines with React, Angular, Vue, …?

Fundamentally, the engines do not care which framework you are using. That said, we have an open source sample of how to use Picovoice with React in the browser with a companion tutorial that explains the details.

What about ES6 and modules? How do I use this with e.g. Webpack?

The JavaScript code provided is fairly conservative and does not use ES6 and newer language features, notably modules (that said, a modern browser is required to work with all of the aforementioned Audio, Worker, and WASM APIs). You may wish to e.g. convert one of the Manager IIFEs to a class and provide a module export which would allow you to elegantly integrate it with a typical Webpack/Babel setup.

Alternatively, you can simply add the scripts as-is to the HTML document head using something like React Helmet, and provide them in a static/public folder with your web application.

Which Picovoice Engines run in the browser?

All of them.

EnginePurpose
PorcupineWake word detection
RhinoSpeech-to-intent (NLU)
CheetahSpeech-to-text (realtime; live feedback)
LeopardSpeech-to-text (file-based; accuracy boost)
OctopusSpeech-to-index (voice search)

How do I create custom wake words or speech-to-intent contexts for the web?

You can use the Picovoice Console to create wake words and design and train speech-to-intent contexts. Personal use accounts are free and enterprise accounts are available with a 30-day trial.