How to Implement Voice Control in Node.js

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

Most developers reach for automatic speech recognition (ASR), also known as speech-to-text (STT) engines when they actually need Speech-to-Intent—a better alternative for enabling custom voice control in domain-specific applications. Speech-to-Intent (also called intent recognition or command recognition) extracts structured meaning from voice commands. Instead of transcribing 'Turn on the bedroom lights' to text, it returns:

intent: 'turnOn',
slots: {
  room: 'bedroom',
  device: 'lights'
}

Unlike STT engines (like Google Speech API or Whisper) that transcribe everything a user says, or natural language understanding engines (like Google Dialogflow and Amazon Lex) initially built for text-based chatbots, Rhino Speech-to-Intent directly maps spoken voice commands to actions without the overhead of full transcription.

Rhino Speech-to-Intent also processes voice commands locally without sending audio externally. This approach delivers better privacy, reliability, and 6x higher accuracy than cloud alternatives.

Problem: Traditional STT sends audio to the cloud, transcribes full sentences, then parses with NLP—adding latency, privacy concerns, and complexity.

Solution: Rhino Speech-to-Intent processes commands directly on-device, extracting structured intents without cloud dependencies.

When to Use Speech-to-Intent vs. Full Transcription:

Use Rhino when: You have predefined commands, need offline operation, require low latency, or prioritize privacy
Use STT when: You need open-ended transcription, chatbot integration, or dictation features

If you need speech-to-text in your Node.js application instead of Speech-to-Intent, refer to our guide Real-time Transcription in Node.js.

This tutorial shows you how to build a voice-activated trigger system using Node.js that runs across Windows, macOS, Linux, and Raspberry Pi; perfect for smart home automation, IoT device control, industrial equipment, accessibility tools, automotive interfaces, and voice-controlled enterprise applications.

Step-by-Step: Voice Control in Node.js

Prerequisites

Download Node.js (v18 or newer)
Sign up for a Picovoice Console account and copy your AccessKey
Train a custom context model on the Picovoice Console and download the model file (.rhn)
Check that you have a working microphone or audio input

For additional guidance on how to train a custom model, check out Creating a Custom Context with Rhino or watch Picovoice Console Tutorial: Rhino Speech-to-Intent on YouTube.

1. Install Packages

Install the Rhino Speech-to-Intent Node.js SDK and the PvRecorder Node.js SDK:

npm install @picovoice/rhino-node @picovoice/pvrecorder-node

2. Initialize the Voice Command Engine

Create an instance of Rhino, passing in your AccessKey and custom context model.

const Rhino = require("@picovoice/rhino-node");

const rhino = new Rhino(
  "${ACCESS_KEY}", // AccessKey from Picovoice Console
  "${CONTEXT_FILE_PATH}" // Your custom context model (.rhn)
);

3. Set Up Audio Capture for Intent Detection

Begin capturing audio with PvRecorder to prepare for intent detection:

const { PvRecorder } = require("@picovoice/pvrecorder-node");

// 1. Initialize and start the audio capture device
const frameLength = 512;
const recorder = new PvRecorder(frameLength);
recorder.start();

// 2. Continuously read frames of audio, which will be passed to Rhino
while (true) {
  const audioFrame = await recorder.read();
  // rhino.process(audioFrame);
}

4. Detect & Map Intents to Actions

Stream audio frames to Rhino and handle recognized inferences. When Rhino detects a possible command, it returns a RhinoInference object.

let isFinalized = false;

while (!isFinalized) {
  const audioFrame = await recorder.read();
  isFinalized = rhino.process(audioFrame);

  if (isFinalized) {
    const inference = rhino.getInference();
    if (inference.isUnderstood) {
      const intent = inference.intent;
      const slots = inference.slots;
      // take action based on inferred intent and slot values
    }
  }
}

5. Clean Up Resources

When done, stop the recorder and release resources to free memory:

recorder.stop();
recorder.release();

rhino.release();

Complete Demo: Voice Commands in Node.js

The following complete example combines all previous steps into a functional Node.js script that continuously listens for commands and logs detections to the console.

const { PvRecorder } = require("@picovoice/pvrecorder-node");
const { Rhino } = require("@picovoice/rhino-node");
const readline = require("readline");

let isRunning = true;

// Listen for spacebar
readline.emitKeypressEvents(process.stdin);
if (process.stdin.isTTY) process.stdin.setRawMode(true);

process.stdin.on("keypress", (str, key) => {
  if (key.name === "space") {
    console.log("Stopping...");
    isRunning = false;
  } else if (key.ctrl && key.name === "c") {
    isRunning = false;
  }
});

async function main() {
  let rhino = null;
  let recorder = null;

  try {
    console.log("Initializing Rhino...");
    rhino = new Rhino(
      "${ACCESS_KEY}", // AccessKey from Picovoice
      "${CONTEXT_FILE_PATH}" // Your custom context model (.rhn)
    );

    console.log("Starting intent detection... Press SPACE to stop.");
    recorder = new PvRecorder(rhino.frameLength);
    recorder.start();

    while (isRunning) {
      const audioFrame = await recorder.read();
      isFinalized = rhino.process(audioFrame);
      if (isFinalized) {
        const inference = rhino.getInference();
        if (inference.isUnderstood) {
          const intent = inference.intent;
          const slots = inference.slots;
          console.log(intent);
          console.log(slots);
        } else {
          console.log("Command not understood")
        }
      }
    }
  } catch (err) {
    console.error("Error:", err);
  } finally {
    if (recorder) {
      try {
        recorder.stop();
        recorder.release();
      } catch (e) {
        console.warn("Failed to stop/release recorder:", e);
      }
    }
    if (rhino) {
      try {
        rhino.release();
      } catch (e) {
        console.warn("Failed to release Rhino:", e);
      }
    }
    console.log("Recorder and Rhino released. Exiting.");
    process.exit(0);
  }
}

main();

This demo uses the following packages:

For a more detailed guide, refer to the documentation:

For a complete demo application, check out the Rhino Speech-to-Intent Node.js Demo on GitHub.

Troubleshooting: Common Issues

Microphone Not Detected or Audio Input Fails

Check device permissions: Ensure your app has access to the system microphone.
Verify sampling rate: Rhino Speech-to-Intent expects 16 kHz, 16-bit mono PCM input; mismatched formats will cause errors.

No Intent Detected

Make sure your .rhn context file matches the phrases being spoken.
Test with clear pronunciation and limit background noise.

Enhance Your Enterprise Voice Solution

Add Wake Word Detection: Add Porcupine Wake Word for hands-free voice commands, enabling your device to wake on "Hey [Brand Name]" without buttons.
Integrate with Voice Activity Detection: Integrate with Cobra Voice Activity Detection to only pass audio to Rhino when someone is speaking.
Multi-Language Support: Train custom voice command models in multiple languages to support users in different regions.

Start Building