Learn how to transcribe speech to text using Picovoice Leopard Speech-to-Text Node.js SDK. The SDK runs on Linux, macOS, Windows, Raspberry Pi, and NVIDIA Jetson.

Speech-to-text (STT), automatic speech recognition (ASR), automatic transcription, and large-vocabulary speech recognition are the same. If you are looking for any of these in Node.js, this is it!

Install Speech-to-Text Node.js SDK

Create a project and install the SDK:

Sign up for Picovoice Console

Log in to (sign up for) Picovoice Console. It is free, and no credit card is required! Copy your AccessKey to the clipboard.

Implement transcription in JavaScript

Create an instance of Leopard with your AccessKey:

Transcribe an audio file. Leopard ASR engine supports almost any audio format, including FLAC, MP3, MP4, m4a, Ogg, WAV, and WebM.

Explore ASR Features

Leopard provides more than just the transcript. It offers:

  • Custom Vocabulary
  • Keyword Boosting
  • Word Timestamps
  • Word-Level Confidence
  • Truecasing
  • Automatic Punctuation

Custom Vocabulary & Keyword Boosting

ASRs can recognize many common words in the language. If you are doing transcription within a specialized domain (e.g. technical, medical, law, or sales), there will be words that are not recognizable by the engine. These are called Out-Of-Vocabulary (OOV) words. Leopard overcomes this by enabling developers to teach Leopard about the OOV words and create custom models using Picovoice Console.

Additionally, sometimes you know some words are likely to happen often. You can improve the accuracy by telling it about these expected keywords and boosting ASR's sensitivity towards them. Picovoice Console also enables you to do this.

Learn more by checking out Picovoice Console STT documentation.

Word Timestamps & Confidence

Word Timestamps are essential for creating subtitles and searching. Word confidence identifies portions of the transcription that the ASR engine is unsure of. The certainty information is beneficial for manual correction and as an additional feature for downstream NLU or NLP tasks (e.g. Intent Inference or Sentiment Analysis).

Inspect the word timestamps and confidence:

The output depends on the input audio. For our sample input, below is a snippet:

Truecasing & Automatic Punctuation

Truecasing and Automatic Punctuation help with the readability of the transcription. Create an instance of Leopard with Truecasing and Automatic Punctuation: