Real-time transcription
is the process of converting audio into text as it is captured. While real-time transcription
is commonly implemented as a cloud-based
service, there are also options for running it on-device
.
Cloud-based real-time transcription
records and sends audio to some vendor server, where the transcription engine is located, to perform the transcription. With this method, any network latency or connectivity issues may result in a delay or disruption in the transcription. In contrast, On-device real-time transcription
performs the transcription directly on-device, eliminating these inherent latency and reliability limitations.
Picovoice's Cheetah Streaming Speech-to-Text
is an on-device
software that transcribes speech to text locally. It ensures your voice data remains private (i.e. it is GDPR
and HIPAA
-compliant by design). Additionally, it guarantees a real-time experience by eliminating unpredictable delays.
Cheetah Streaming Speech-to-Text
can run on Linux
, macOS
, Windows
, Raspberry Pi
, and NVIDIA Jetson
.
In just a few minutes, you can start transcribing speech to text in real time using the Cheetah Streaming Speech-to-Text Node.js SDK. Let's get started!
Project setup
Create a new folder and initialize an npm project:
Next, install @picovoice/pvrecorder-node and @picovoice/cheetah-node.
@picovoice/pvrecorder-node
will be used to record microphone audio@picovoice/cheetah-node
will perform the speech to text transcription
Sign up for Picovoice Console
Log in to (or sign up for) Picovoice Console. It is free, and no credit card is required!
Copy your AccessKey
from the main dashboard.
Initialize Cheetah
In a new index.js
file, create an instance of Cheetah
with your AccessKey
:
Capture Microphone Audio
Next, we need to pass audio to Cheetah
to be transcribed. This audio can be from a microphone or a stream you receive from another source, as long as the audio frames are of a specific frame length
(specified by cheetah.frameLength
) and the audio itself is recorded at the required sample rate
(specified by cheetah.sampleRate
).
In digital audio, an audio frame
refers to a discrete unit of audio data that represents a brief moment in time. An audio frame
consists of a number of samples
, each of which is a numeric value that represents the amplitude of the sound waveform at a single point in time. The number of samples
in each audio frame
is referred to as its frame length
.
To record audio with the appropriate frame length
, we can use PvRecorder - an audio recorder designed for real-time speech audio processing.
Create an instance of PvRecorder
with cheetah.frameLength
, and call pvRecorder.start()
.
To stop recording audio, call pvRecorder.stop()
Transcribe Audio
Each call to pvRecorder.read()
will return a single audioFrame
that you can then pass to cheetah
for processing. Once processed, cheetah
will return a partialTranscript
string and an isEndpoint
bool. When an endpoint
is detected, call cheetah.flush()
to return any remaining transcribed text.
partialTranscript
represents the most recent portion of the transcription - append this to previous values ofpartialTranscript
to form the "full" transcriptionisEndpoint
is a flag that will be set totrue
whencheetah
detects a chunk of audio (1s
by default) after an utterance without any speech in it
Complete App
A complete working example might look something like this:
Finally, run the file and speak into your microphone to see the live transcription in your terminal!
For more information, check out the Cheetah Streaming Speech-to-Text product page or refer to the Cheetah Streaming Speech-to-Text Node.js SDK quick start guide.
Have you seen our other Node.js tutorials? Don’t forget to check out Batch Transcription with Node.js, Speaker Recognition with Node.js, and Voice Activity Detection with Node.js.