🚀 Best-in-class Voice AI!
Add low latency real-time transcription to your application with Cheetah Streaming Speech-to-Text.
Start Free

Real-time transcription has become essential for modern applications such as live captions, conversational interfaces, voice automation, and compliance-driven analytics. Cloud speech-to-text (STT) solutions, such as Azure Speech AI, Amazon Transcribe, and Google Streaming ASR, introduce inherent challenges: network latency, privacy concerns, and inconsistent performance in environments with unreliable internet connectivity.

To overcome these issues, enterprise developers are shifting to fully local speech recognition. Running STT directly on the device ensures deterministic latency, preserves data privacy, and provides consistent performance across diverse deployment environments. It also simplifies requirements around PII handling, auditability, and data residency, since audio never leaves the device.

Cheetah Streaming Speech-to-Text provides fast, accurate, and cross-platform (Windows, Linux, macOS, and Raspberry Pi) voice transcription fully on-device.

This guide shows how to build a real-time streaming speech-to-text Java application using Cheetah Streaming Speech-to-Text.

What You'll Learn

  • Train personalized speech-to-text models to incorporate custom vocabulary and prioritize the recognition of specific words
  • Capture and process microphone audio in Java
  • Manage PCM buffers and audio formats
  • Stream audio frames into a real-time speech-to-text engine
  • Handle real-time transcripts and endpoint detection

Train Custom Speech-to-Text Models

Cheetah Streaming Speech-to-Text supports custom vocabulary to deliver precise transcription of specialized terminology. Developers can create customized STT models by adding unique words, fine-tuning pronunciations, and prioritizing recognition of key phrases—crucial for industries such as healthcare, finance, manufacturing, and customer support.

Easily create and manage custom models via the Picovoice Console. Follow our step-by-step guide to building a custom Cheetah model or watch the Cheetah Console Tutorial on YouTube.

For applications that don't require customization, you can use one of the default Cheetah speech-to-text models.

Step-by-Step: Real-time Transcription in Java

Prerequisites

  • Install JDK 11+
  • Create a free account on the Picovoice Console and copy your AccessKey
  • Ensure you have a functioning microphone

1. Project setup

Include cheetah-java as a dependency in your build.gradle file. Replace ${LATEST_VERSION} with the latest version available:

Also include the following block in build.gradle so our demo application can handle keyboard input later:

2. Initialize Cheetah

Create a Cheetah instance using its builder:

Replace ${ACCESS_KEY} with your AccessKey from Picovoice Console.

3. Configure Microphone Input

Cheetah determines the required audio settings:

  • 16 kHz sample rate
  • 16-bit
  • Mono
  • Little-endian PCM

This format matches Cheetah's internal audio requirements.

Next, open the microphone:

4. Handle Audio Buffers

Because Cheetah consumes audio frame by frame, you must feed it exact slices of PCM samples.

  • Frame size: cheetah.getFrameLength()
  • Sample format: Signed 16-bit integers (short in Java)
  • Byte calculation: Each frame requires frameLength * 2 bytes (16 bits = 2 bytes per sample)

Your application must:

  • Capture raw bytes from the microphone
  • Interpret them as little-endian PCM
  • Convert them into a short[]
  • Pass them to cheetah.process()

This ensures that audio is aligned correctly with Cheetah's model.

5. Process Audio Frames & Stream the Transcript

Every time you pass a frame to Cheetah, it adds the audio to its internal buffer. When enough context is available, it returns a partial transcript; otherwise, it returns null.

Cheetah also detects endpoints (natural pauses in speech). When an endpoint is detected, call flush to process any remaining buffered audio:

6. Full Working Example (Copy & Run)

Below is the complete example Java program, combining initialization, audio capture, streaming STT, endpoint detection, and cleanup.

Replace ${ACCESS_KEY} with your AccessKey from Picovoice Console before running.

For a complete Java application, see the Cheetah Java demo on GitHub.

This tutorial uses the following package:

Explore our documentation for more details:

Troubleshooting: Common Issues

1. Microphone Not Detected

  • Confirm the OS has granted mic permissions
  • Ensure you're using a supported format: 16 kHz, mono, 16-bit linear PCM

2. Transcription Delays or Gaps

  • Make sure every process() call receives exactly frameLength samples
  • Avoid allocating new buffers on every loop—reuse them instead

3. Empty Transcription Output

  • Verify your microphone gain
  • Ensure correct logging
  • Speak clearly into the mic for testing

Expand Your Real-Time Voice Pipeline

Once you have streaming STT running, consider adding:

Start Building