Real-time transcription is the process of converting audio into text as it is captured. While
real-time transcription is commonly implemented as a
cloud-based service, there are also options for running it
Cloud-based real-time transcription records and sends audio to some vendor server, where the transcription engine is located, to perform the transcription. With this method, any network latency or connectivity issues may result in a delay or disruption in the transcription. In contrast,
On-device real-time transcription performs the transcription directly on-device, eliminating these inherent latency and reliability limitations.
Cheetah Streaming Speech-to-Text is an
on-device software that transcribes speech to text locally. It ensures your voice data remains private (i.e. it is
HIPAA-compliant by design). Additionally, it guarantees a real-time experience by eliminating unpredictable delays.
Cheetah Streaming Speech-to-Text can run on
Raspberry Pi, and
In just a few minutes, you can start transcribing speech to text in real time using the Cheetah Streaming Speech-to-Text Node.js SDK. Let's get started!
Create a new folder and initialize an npm project:
@picovoice/pvrecorder-nodewill be used to record microphone audio
@picovoice/cheetah-nodewill perform the speech to text transcription
Sign up for Picovoice Console
Log in to (or sign up for) Picovoice Console . It is free, and no credit card is required!
AccessKey from the main dashboard.
In a new
index.js file, create an instance of
Cheetah with your
Capture Microphone Audio
Next, we need to pass audio to
Cheetah to be transcribed. This audio can be from a microphone or a stream you receive from another source, as long as the audio frames are of a specific
frame length (specified by
cheetah.frameLength) and the audio itself is recorded at the required
sample rate (specified by
In digital audio, an
audio frame refers to a discrete unit of audio data that represents a brief moment in time. An
audio frame consists of a number of
samples, each of which is a numeric value that represents the amplitude of the sound waveform at a single point in time. The number of
samples in each
audio frame is referred to as its
To record audio with the appropriate
frame length, we can use PvRecorder - an audio recorder designed for real-time speech audio processing.
Create an instance of
cheetah.frameLength, and call
To stop recording audio, call
Each call to
pvRecorder.read() will return a single
audioFrame that you can then pass to
cheetah for processing. Once processed,
cheetah will return a
partialTranscript string and an
isEndpoint bool. When an
endpoint is detected, call
cheetah.flush() to return any remaining transcribed text.
partialTranscriptrepresents the most recent portion of the transcription - append this to previous values of
partialTranscriptto form the "full" transcription
isEndpointis a flag that will be set to
cheetahdetects a chunk of audio (
1sby default) after an utterance without any speech in it
A complete working example might look something like this:
Finally, run the file and speak into your microphone to see the live transcription in your terminal!
For more information, check out the Cheetah Streaming Speech-to-Text product page or refer to the Cheetah Streaming Speech-to-Text Node.js SDK quick start guide.