How to Build Real-Time Streaming Speech-to-Text in Java

🚀 Best-in-class Voice AI!

Add low latency real-time transcription to your application with Cheetah Streaming Speech-to-Text.

Real-time transcription has become essential for modern applications such as live captions, conversational interfaces, voice automation, and compliance-driven analytics. Cloud speech-to-text (STT) solutions, such as Azure Speech AI, Amazon Transcribe, and Google Streaming ASR, introduce inherent challenges: network latency, privacy concerns, and inconsistent performance in environments with unreliable internet connectivity.

To overcome these issues, enterprise developers are shifting to fully local speech recognition. Running STT directly on the device ensures deterministic latency, preserves data privacy, and provides consistent performance across diverse deployment environments. It also simplifies requirements around PII handling, auditability, and data residency, since audio never leaves the device.

Cheetah Streaming Speech-to-Text provides fast, accurate, and cross-platform (Windows, Linux, macOS, and Raspberry Pi) voice transcription fully on-device.

This guide shows how to build a real-time streaming speech-to-text Java application using Cheetah Streaming Speech-to-Text.

What You'll Learn

Train personalized speech-to-text models to incorporate custom vocabulary and prioritize the recognition of specific words
Capture and process microphone audio in Java
Manage PCM buffers and audio formats
Stream audio frames into a real-time speech-to-text engine
Handle real-time transcripts and endpoint detection

Train Custom Speech-to-Text Models

Cheetah Streaming Speech-to-Text supports custom vocabulary to deliver precise transcription of specialized terminology. Developers can create customized STT models by adding unique words, fine-tuning pronunciations, and prioritizing recognition of key phrases—crucial for industries such as healthcare, finance, manufacturing, and customer support.

See how using custom medical vocabulary with Cheetah Streaming Speech-to-Text can reduce word error rate (WER) by 57%.

Easily create and manage custom models via the Picovoice Console. Follow our step-by-step guide to building a custom Cheetah model or watch the Cheetah Console Tutorial on YouTube.

For applications that don't require customization, you can use one of the default Cheetah speech-to-text models.

Step-by-Step: Real-time Transcription in Java

Prerequisites

Install JDK 11+
Create a free account on the Picovoice Console and copy your AccessKey
Ensure you have a functioning microphone

1. Project setup

Include cheetah-java as a dependency in your build.gradle file. Replace ${LATEST_VERSION} with the latest version available:

implementation 'ai.picovoice:cheetah-java:${LATEST_VERSION}'

Also include the following block in build.gradle so our demo application can handle keyboard input later:

run {
    standardInput = System.in
}

2. Initialize Cheetah

Create a Cheetah instance using its builder:

Cheetah cheetah = new Cheetah.Builder()
        .setAccessKey("${ACCESS_KEY}")
        .build();

Replace ${ACCESS_KEY} with your AccessKey from Picovoice Console.

3. Configure Microphone Input

Cheetah determines the required audio settings:

16 kHz sample rate
16-bit
Mono
Little-endian PCM

final int sampleRate = cheetah.getSampleRate();
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);

This format matches Cheetah's internal audio requirements.

Next, open the microphone:

DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
TargetDataLine mic = (TargetDataLine) AudioSystem.getLine(info);
mic.open(format);
mic.start();

4. Handle Audio Buffers

Because Cheetah consumes audio frame by frame, you must feed it exact slices of PCM samples.

Frame size: cheetah.getFrameLength()
Sample format: Signed 16-bit integers (short in Java)
Byte calculation: Each frame requires frameLength * 2 bytes (16 bits = 2 bytes per sample)

Your application must:

Capture raw bytes from the microphone
Interpret them as little-endian PCM
Convert them into a short[]
Pass them to cheetah.process()

// set up PCM buffer
final int frameLength = cheetah.getFrameLength();
final int frameBytes = frameLength * 2; // 16-bit audio
ByteBuffer captureBuffer = ByteBuffer.allocate(frameBytes);
captureBuffer.order(ByteOrder.LITTLE_ENDIAN);
short[] cheetahBuffer = new short[frameLength];

// read and pass audio to Cheetah
numBytesRead = mic.read(captureBuffer.array(), 0, captureBuffer.capacity());
if (numBytesRead != frameBytes) {
    continue;
}
captureBuffer.asShortBuffer().get(cheetahBuffer);
CheetahTranscript transcriptObj = cheetah.process(cheetahBuffer);

This ensures that audio is aligned correctly with Cheetah's model.

5. Process Audio Frames & Stream the Transcript

Every time you pass a frame to Cheetah, it adds the audio to its internal buffer. When enough context is available, it returns a partial transcript; otherwise, it returns null.

CheetahTranscript transcript = cheetah.process(cheetahBuffer);
System.out.print(transcript.getTranscript());

Cheetah also detects endpoints (natural pauses in speech). When an endpoint is detected, call flush to process any remaining buffered audio:

if (transcript.getIsEndpoint()) {
    CheetahTranscript finalChunk = cheetah.flush();
    System.out.println(finalChunk.getTranscript());
}

6. Full Working Example (Copy & Run)

Below is the complete example Java program, combining initialization, audio capture, streaming STT, endpoint detection, and cleanup.

Replace ${ACCESS_KEY} with your AccessKey from Picovoice Console before running.

package javacheetah;

import ai.picovoice.cheetah.*;
import javax.sound.sampled.*;
import java.io.*;
import java.nio.*;

public class App {

    public static void main(String[] args) throws Exception {

        // Initialize Cheetah
        Cheetah cheetah = null;
        try {
            cheetah = new Cheetah.Builder()
                    .setAccessKey("${ACCESS_KEY}")
                    .build();
        } catch (Exception e) {
            System.err.println("Failed to initialize Cheetah: " + e.getMessage());
            return;
        }

        // Microphone format: 16-kHz, mono, 16-bit, little-endian
        final int sampleRate = cheetah.getSampleRate();
        AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);

        TargetDataLine mic;
        try {
            DataLine.Info dataLineInfo = new DataLine.Info(TargetDataLine.class, format);
            mic = (TargetDataLine) AudioSystem.getLine(dataLineInfo);
            mic.open(format);
        } catch (Exception ex) {
            System.err.println("Could not open microphone: " + ex.getMessage());
            cheetah.delete();
            return;
        }

        mic.start();
        System.out.println("Listening… press Enter to stop.\n");

        // Buffers for processing audio
        final int frameLength = cheetah.getFrameLength();
        final int frameBytes = frameLength * 2; // 16-bit audio
        ByteBuffer captureBuffer = ByteBuffer.allocate(frameBytes);
        captureBuffer.order(ByteOrder.LITTLE_ENDIAN);
        short[] cheetahBuffer = new short[frameLength];

        try {
            int numBytesRead;
            while (System.in.available() == 0) {

                // read a buffer of audio
                numBytesRead = mic.read(captureBuffer.array(), 0, captureBuffer.capacity());

                // don't pass to cheetah if we don't have a full buffer
                if (numBytesRead != frameBytes) {
                    continue;
                }

                // copy into 16-bit buffer
                captureBuffer.asShortBuffer().get(cheetahBuffer);

                // process with cheetah
                CheetahTranscript transcriptObj = cheetah.process(cheetahBuffer);
                System.out.print(transcriptObj.getTranscript());
                if (transcriptObj.getIsEndpoint()) {
                    CheetahTranscript endpointTranscriptObj = cheetah.flush();
                    System.out.println(endpointTranscriptObj.getTranscript());
                }
                System.out.flush();
            }
            System.out.println("\nStopping...");
        } catch (Exception e) {
            System.err.println("Error: " + e.toString());
        } finally {
            // Cleanup
            mic.stop();
            mic.close();
            cheetah.delete();
        }
    }
}

For a complete Java application, see the Cheetah Java demo on GitHub.

This tutorial uses the following package:

cheetah-java

Explore our documentation for more details:

Troubleshooting: Common Issues

1. Microphone Not Detected

Confirm the OS has granted mic permissions
Ensure you're using a supported format: 16 kHz, mono, 16-bit linear PCM

2. Transcription Delays or Gaps

Make sure every process() call receives exactly frameLength samples
Avoid allocating new buffers on every loop—reuse them instead

3. Empty Transcription Output

Verify your microphone gain
Ensure correct logging
Speak clearly into the mic for testing

Expand Your Real-Time Voice Pipeline

Once you have streaming STT running, consider adding:

Wake Word Activation: Use Porcupine Wake Word to activate transcription hands-free.
Custom Voice Control: Use Rhino Speech-to-Intent to interpret the meaning of spoken commands.
Multilingual Support: Train and deploy additional languages for your application's target regions.

Start Building