Speech-to-Text using Node.js

🚀 Best-in-class Voice AI!

Build compliant and low-latency AI apps running entirely on mobile without sending user data to 3rd party servers.

Learn how to transcribe speech to text using Picovoice Leopard Speech-to-Text Node.js SDK. The SDK runs on Linux, macOS, Windows, Raspberry Pi, and NVIDIA Jetson.

Speech-to-text (STT), automatic speech recognition (ASR), automatic transcription, and large-vocabulary speech recognition are the same. If you are looking for any of these in Node.js, this is it!

Install Speech-to-Text Node.js SDK

Create a project and install the SDK:

npm install @picovoice/leopard-node

Log in to (sign up for) Picovoice Console. It is free, and no credit card is required! Copy your AccessKey to the clipboard.

Implement transcription in JavaScript

Create an instance of Leopard with your AccessKey:

const {Leopard} = require("@picovoice/leopard-node");

const handle = new Leopard(accessKey);

Transcribe an audio file. Leopard ASR engine supports almost any audio format, including FLAC, MP3, MP4, m4a, Ogg, WAV, and WebM.

const result = handle.processFile(audioPath);
console.log(result.transcript);

Explore ASR Features

Leopard provides more than just the transcript. It offers:

Custom Vocabulary
Keyword Boosting
Word Timestamps
Word-Level Confidence
Truecasing
Automatic Punctuation

Custom Vocabulary & Keyword Boosting

ASRs can recognize many common words in the language. If you are doing transcription within a specialized domain (e.g. technical, medical, law, or sales), there will be words that are not recognizable by the engine. These are called Out-Of-Vocabulary (OOV) words. Leopard overcomes this by enabling developers to teach Leopard about the OOV words and create custom models using Picovoice Console.

Additionally, sometimes you know some words are likely to happen often. You can improve the accuracy by telling it about these expected keywords and boosting ASR's sensitivity towards them. Picovoice Console also enables you to do this.

Learn more by checking out Picovoice Console STT documentation.

Word Timestamps & Confidence

Word Timestamps are essential for creating subtitles and searching. Word confidence identifies portions of the transcription that the ASR engine is unsure of. The certainty information is beneficial for manual correction and as an additional feature for downstream NLU or NLP tasks (e.g. Intent Inference or Sentiment Analysis).

Inspect the word timestamps and confidence:

const result = handle.processFile(audioPath);
console.log(result.words);

The output depends on the input audio. For our sample input, below is a snippet:

[
  ...
  {
    word: 'noodle',
    startSec: 2.78,
    endSec: 3.10,
    confidence: 0.66
  },
  {
    word: 'soup',
    startSec: 3.23,
    endSec: 3.51,
    confidence: 0.98
  },
  {
    word: 'with',
    startSec: 4.12,
    endSec: 4.28,
    confidence: 0.97
  },
  {
    word: 'bread',
    startSec: 4.57,
    endSec: 4.89,
    confidence: 0.92
  },
  ...
]

Truecasing & Automatic Punctuation

Truecasing and Automatic Punctuation help with the readability of the transcription. Create an instance of Leopard with Truecasing and Automatic Punctuation:

const handle = new Leopard(accessKey, {enableAutomaticPunctuation: true});

Have you seen our other Node.js tutorials? Don’t forget to check out Real-time Transcription with Node.js, Speaker Recognition with Node.js, and Voice Activity Detection with Node.js.