Real-Time Transcription in JavaScript

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps running within web browsers without sending user data to 3rd party servers.

Real-time transcription is the process of converting spoken words into text immediately as they are spoken. While it is common to use cloud-based services for real-time transcription, there are also options available for running it on-device.

When using cloud-based real-time transcription, the audio is recorded and sent to a vendor server that houses the transcription engine. This server then transcribes the audio, and sends the transcription back to the client. This method can be susceptible to delays or interruptions in transcription due to network latency or connectivity issues. In contrast, on-device real-time transcription performs the transcription directly on a local device, eliminating these inherent latency and reliability challenges.

Picovoice's Cheetah Streaming Speech-to-Text is an on-device software designed to perform speech-to-text locally. Cheetah ensures your voice data remains private (i.e. it is GDPR and HIPAA-compliant by design). Additionally, it guarantees a real-time experience by eliminating unpredictable delays.

Cheetah Streaming Speech-to-Text can run on Linux, macOS, Windows, Raspberry Pi, and NVIDIA Jetson.

In just a few minutes, you can start transcribing speech to text in real time using the Cheetah Streaming Speech-to-Text JavaScript SDK. Let's get started!

1. Project setup

Create a new folder and initialize an npm project:

npm init -y

Ensure Node.js is installed. Next, install @picovoice/web-voice-processor and @picovoice/cheetah-web:

npm install @picovoice/web-voice-processor @picovoice/cheetah-web

Also install http-server as a development dependency, so we can view our project on localhost:

npm install http-server --save-dev

2. HTML

Create an index.html file with the following scripts:

<!DOCTYPE html>
<html>
  <head>
    <script src="node_modules/@picovoice/cheetah-web/dist/iife/index.js"></script>
    <script src="node_modules/@picovoice/web-voice-processor/dist/iife/index.js"></script>
  </head>
  <body>
  </body>
</html>

add the following line to the project's package.json's scripts:

"start": "http-server -a localhost -p 5000"

You'll now be able to run the local server to load the page:

npm run start

You can see the page at http://localhost:5000. This will just look like a blank page for now.

3. Picovoice Console

Download the default model and put it in the project's root directory. If you're adding Cheetah to an existing project, put the model in the public (or equivalent) directory instead.

Instead of using the default model, you can also use the Picovoice console to create a custom model if you want to add custom vocabulary and/or boost the probability of certain words.

4. Initialize Cheetah

In a <script> tag within the <body> of the html file, create an instance of CheetahWorker with your Picovoice AccessKey and a transcriptCallback function.

<!--...-->
<body>
  <script type="application/javascript">
    let fullTranscript = "";
    function transcriptCallback(cheetahTranscript) {
      fullTranscript += cheetahTranscript.transcript;
      if (cheetahTranscript.isEndpoint) {
        fullTranscript += "\n";
      }
    }

    const cheetah = await CheetahWeb.CheetahWorker.create(
      "${ACCESS_KEY}",
      transcriptCallback,
      { publicPath: "${MODEL_RELATIVE_PATH}" }
    );
  </script>
</body>
<!--...-->

When audio has been processed, Cheetah will return via the transcriptCallback function a transcript string and an isEndpoint bool.

transcript represents the most recent portion of the transcription
isEndpoint is a flag that will be set to true when Cheetah detects a chunk of audio (1s by default) after an utterance without any speech in it

5. Start Detecting Voice

In order to begin transcribing speech, we need to be able to access and pass audio to Cheetah. The Web Audio API and the MediaStream API are commonly used by developers to work with audio in web browsers. Although powerful, setup for the Web Audio and MediaStream APIs can be fairly complex. This is why we created Web Voice Processor - an open-source library that handles recording audio and passing it to Cheetah for you.

To start detecting voice, simply subscribe cheetah to WebVoiceProcessor.

async function startCheetah() {
  await WebVoiceProcessor.WebVoiceProcessor.subscribe(cheetah)
}

To stop processing audio, unsubscribe cheetah.

async function stopCheetah() {
  await WebVoiceProcessor.WebVoiceProcessor.unsubscribe(cheetah)
}

6. Complete HTML

Add some html elements and app logic to see Cheetah in action. It might look something like this:

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Cheetah Real-time Speech-to-Text - Picovoice</title>
    <script src="node_modules/@picovoice/cheetah-web/dist/iife/index.js"></script>
    <script src="node_modules/@picovoice/web-voice-processor/dist/iife/index.js"></script>
  </head>
  <body>
    <div>Real-time Transcription: <span id="transcript"></span></div>
    <button id="start-cheetah">Start Cheetah</button>
    <button id="stop-cheetah">Stop Cheetah</button>
    <script type="application/javascript">
      const transcriptSpan = document.getElementById('transcript')
      const startCheetahButton = document.getElementById('start-cheetah')
      const stopCheetahButton = document.getElementById('stop-cheetah')
      startCheetahButton.addEventListener('click', startCheetah)
      stopCheetahButton.addEventListener('click', stopCheetah)
      
      let cheetah = null

      let fullTranscript = "";
      function transcriptCallback(cheetahTranscript) {
        fullTranscript += cheetahTranscript.transcript;
        if (cheetahTranscript.isEndpoint) {
          fullTranscript += "\n";
        }
        transcriptSpan.innerText = fullTranscript;
      }
      
      async function initCheetah() {
        cheetah = await CheetahWeb.CheetahWorker.create(
          "${ACCESS_KEY}",
          transcriptCallback,
          { publicPath: "${MODEL_RELATIVE_PATH}" }
        );
      }

      async function startCheetah() {
        startCheetahButton.innerText = "Loading…"
        if (!cheetah) {
          await initCheetah()
        }

        await WebVoiceProcessor.WebVoiceProcessor.subscribe(cheetah)
        startCheetahButton.innerText = "Listening…"
      }

      async function stopCheetah() {
        await WebVoiceProcessor.WebVoiceProcessor.unsubscribe(cheetah)
        startCheetahButton.innerText = "Start Cheetah"
      }
    </script>
  </body>
</html>

Finally, go back to http://localhost:5000. Click "Start Cheetah" and speak into your microphone to see the live transcription!

Adding to Existing Project?

If you are working within an existing project that has a module bundler, you can use the import syntax instead:

import { CheetahWorker } from "@picovoice/cheetah-web"
import { WebVoiceProcessor } from "@picovoice/web-voice-processor"

// ...

// Change "CheetahWeb.CheetahWorker." to "CheetahWorker."
await CheetahWorker.create(
  "${ACCESS_KEY}",
  transcriptCallback,
  { publicPath: "${MODEL_RELATIVE_PATH}" }
)

// Change "WebVoiceProcessor.WebVoiceProcessor." to "WebVoiceProcessor."
await WebVoiceProcessor.subscribe(cheetah)
await WebVoiceProcessor.unsubscribe(cheetah)

For more information, check out the Cheetah Streaming Speech-to-Text product page or refer to the Cheetah Voice Activity Detection JavaScript SDK quick start guide.