Real-time transcription is the process of converting audio into text as it is captured. While real-time transcription is commonly implemented as a cloud-based service, there are also options for running it on-device.

Cloud-based real-time transcription records and sends audio to some vendor server, where the transcription engine is located, to perform the transcription. With this method, any network latency or connectivity issues may result in a delay or disruption in the transcription. In contrast, On-device real-time transcription performs the transcription directly on-device, eliminating these inherent latency and reliability limitations.

Picovoice's Cheetah Streaming Speech-to-Text is an on-device software that transcribes speech to text locally. It ensures your voice data remains private (i.e. it is GDPR and HIPAA-compliant by design). Additionally, it guarantees a real-time experience by eliminating unpredictable delays.

Cheetah Streaming Speech-to-Text can run on Linux, macOS, Windows, Raspberry Pi, and NVIDIA Jetson.

In just a few minutes, you can start transcribing speech to text in real time using the Cheetah Streaming Speech-to-Text Node.js SDK. Let's get started!

Project setup

Create a new folder and initialize an npm project:

Next, install @picovoice/pvrecorder-node and @picovoice/cheetah-node.

  • @picovoice/pvrecorder-node will be used to record microphone audio
  • @picovoice/cheetah-node will perform the speech to text transcription

Sign up for Picovoice Console

Log in to (or sign up for) Picovoice Console. It is free, and no credit card is required!

Copy your AccessKey from the main dashboard.

Initialize Cheetah

In a new index.js file, create an instance of Cheetah with your AccessKey:

Capture Microphone Audio

Next, we need to pass audio to Cheetah to be transcribed. This audio can be from a microphone or a stream you receive from another source, as long as the audio frames are of a specific frame length (specified by cheetah.frameLength) and the audio itself is recorded at the required sample rate (specified by cheetah.sampleRate).

In digital audio, an audio frame refers to a discrete unit of audio data that represents a brief moment in time. An audio frame consists of a number of samples, each of which is a numeric value that represents the amplitude of the sound waveform at a single point in time. The number of samples in each audio frame is referred to as its frame length.

To record audio with the appropriate frame length, we can use PvRecorder - an audio recorder designed for real-time speech audio processing.

Create an instance of PvRecorder with cheetah.frameLength, and call pvRecorder.start().

To stop recording audio, call pvRecorder.stop()

Transcribe Audio

Each call to pvRecorder.read() will return a single audioFrame that you can then pass to cheetah for processing. Once processed, cheetah will return a partialTranscript string and an isEndpoint bool. When an endpoint is detected, call cheetah.flush() to return any remaining transcribed text.

  • partialTranscript represents the most recent portion of the transcription - append this to previous values of partialTranscript to form the "full" transcription
  • isEndpoint is a flag that will be set to true when cheetah detects a chunk of audio (1s by default) after an utterance without any speech in it

Complete App

A complete working example might look something like this:

Finally, run the file and speak into your microphone to see the live transcription in your terminal!


For more information, check out the Cheetah Streaming Speech-to-Text product page or refer to the Cheetah Streaming Speech-to-Text Node.js SDK quick start guide.

Have you seen our other Node.js tutorials? Don’t forget to check out Batch Transcription with Node.js, Speaker Recognition with Node.js, and Voice Activity Detection with Node.js.