🎯 Voice AI Consulting
Get dedicated support and consultation to ensure your specific needs are met.
Consult an AI Expert

Enterprise developers building real-time speech recognition in C face challenges with audio pipeline complexity, memory management, and cross-platform compatibility. Streaming speech-to-text (STT) solutions must also work reliably on Linux, Windows, macOS, and Raspberry Pi while maintaining minimal latency.

Cloud-based services like Azure Real-Time STT, Amazon Transcribe Streaming, and Google Streaming ASR require constant internet connectivity and send audio data to remote servers. For applications handling sensitive audio or that run in environments with unreliable internet connection, on-device speech recognition is essential.

This tutorial shows how to implement cross-platform streaming speech to text in C using Cheetah Streaming Speech-to-Text, an on-device engine compatible with Linux, Windows, macOS, and Raspberry Pi. You'll learn to capture live microphone input, process audio frames in real time, and generate accurate transcriptions—all from a single codebase.

By the end, you'll have a working C application that performs real-time transcription across all major platforms with a single codebase.

Important: This guide builds on How to Record Audio in C. If you haven't completed that setup yet, start with that tutorial to get your recording environment in place.

Prerequisites

  • C99-compatible compiler
  • Windows: MinGW

Supported Platforms

  • Linux (x86_64)
  • macOS (x86_64, arm64)
  • Windows (x86_64, arm64)
  • Raspberry Pi (3, 4, 5)

Project Setup

This is the folder structure used in this tutorial. You can organize your files differently if you like, but make sure to update the paths in the examples accordingly:

To set up audio capture (pvrecorder), refer to How to Record Audio in C.

Step 1. Add Cheetah library files

  1. Create a folder named cheetah/.
  2. Download the Cheetah header files from GitHub and place them in:
  1. Download a Cheetah model file and the correct library file for your platform and place them in:

If your application needs to recognize custom vocabulary, boost recognition of specific phrases, and custom pronunciations, train a custom STT model instead of using one of the default model files.

Implement Dynamic Loading

Cheetah distributes pre-built platform libraries, meaning:

  • the shared library (.so, .dylib, .dll) is not linked at compile time
  • the program loads it at runtime
  • functions must be retrieved by name

So, we need to write small helper functions to:

  1. open the shared library
  2. look up function pointers
  3. close the library

Step 2. Include platform-specific headers

Why these matter

  • On Windows systems, windows.h provides the LoadLibrary function to load a shared library and GetProcAddress to retrieve individual function pointers.
  • On Unix-based systems, dlopen and dlsym from the dlfcn.h header provide the same functionality.
  • Lastly, signal.h allows us to handle Ctrl-C later in this example.

Step 3. Define dynamic loading helper functions

3a. Open the shared library

3b. Load function symbols

3c. Close the library

3d. Print platform-correct errors

Implement Streaming Speech-to-Text

Now that we've set up dynamic loading, we can actually use the Cheetah API.

Step 4. Load the library file

Downloaded the correct library file for your platform and point library_path to the file.

Step 5. Initialize Cheetah

  1. Sign up for an account on Picovoice Console for free and obtain your AccessKey
  2. Replace ${ACCESS_KEY} with your AccessKey
  3. Download a model file and point model_path to the file. You can choose between default and fast models for each supported language.

Call pv_cheetah_init to create a Cheetah instance:

Explanation of parameters:

  • access_key: Picovoice Console AccessKey
  • model_path: Choose desired language model or train a custom model
  • endpoint_duration_sec: Duration of endpoint in seconds. A speech endpoint is detected when there is a segment of audio (with a duration specified herein) after an utterance without any speech in it. Set to 0 to disable endpoint detection.
  • enable_automatic_punctuation : Set to true to enable automatic punctuation insertion.

Step 6. Transcribe audio

Pass recorded audio frames (with PvRecorder) to Cheetah for processing:

Explanation:

  • pv_cheetah_frame_length: Required number of samples per frame.
  • pv_cheetah_process: Buffers audio until sufficient context is available. Once sufficient context is available, it returns a partial transcript; otherwise, it returns NULL.
    • is_endpoint: Indicates a natural pause in speech, marking a possible end of an utterance.
  • pv_cheetah_flush: Transcribes any remaining buffered audio.

Step 7. Cleanup

When done, delete Cheetah to free memory:

Complete Example: On-device Streaming Transcription in C

Here is the complete cheetah_tutorial.c you can copy, build, and run (complete with PvRecorder):

  • Replace ${ACCESS_KEY} with your AccessKey from Picovoice Console
  • update model_path to point to the Cheetah model file (.pv)
  • update library_path to point to the correct Cheetah library for your platform
  • update pv_recorder_library_path to point to the correct PvRecorder library for your platform

This is a simplified example but includes all the necessary components to get started. Check out the Cheetah C demo on GitHub for a complete demo application.

Build & Run

Build and run the application:

Linux (gcc) and Raspberry Pi (gcc)

macOS (clang)

Windows (MinGW)


Troubleshooting Common Issues

1. Speech-to-Text Returns Silence or No Transcription

Make sure you're capturing audio from the correct microphone. If you're using PvRecorder, check that it's set up properly before proceeding.

2. Partial Words or Truncated Transcriptions

If words appear cut off or transcriptions seem incomplete, you may be terminating the audio stream before all buffered audio has been processed. Cheetah Streaming Speech-to-Text maintains an internal buffer to ensure accurate context-based recognition.

Solution: Always call pv_cheetah_flush after you've finished streaming audio. This function processes any remaining buffered audio and returns the final transcript segment.

3. Increase Transcription Speed

If you need faster transcriptions than the default model provides, consider using a Cheetah fast model instead.

Solution: Switch to a fast model variant designed for lower latency. Fast models process audio more quickly with a minor reduction in accuracy—typically acceptable for real-time applications where responsiveness is critical.

4. Library Initialization Fails on Target Platform

If Cheetah fails to initialize, you may be using an incorrect library binary for your system architecture.

Solution: Download the correct library file for your specific platform and architecture combination (e.g., Linux x86_64, macOS ARM64, Windows x86_64, Raspberry Pi). The library file extension varies by platform: .so (Linux), .dylib (macOS), .dll (Windows).

Start Building

Frequently Asked Questions

Can I use multiple STT engines simultaneously in C?
Yes. You can run each engine on its own thread with separate audio buffers. Ensure proper synchronization to avoid race conditions.
What is the ideal audio frame size for streaming STT in C?
Frame sizes of 256–1024 samples are typical. Picovoice engines typically require a frame size of 512. Smaller frames reduce latency but increase CPU usage; larger frames reduce CPU load but increase latency.
How do I compile for cross-platform deployment from a single codebase?
The code in this tutorial is already cross-platform. Use conditional compilation directives (e.g. "#if defined(_WIN32)") to handle platform-specific library loading. Compile with the appropriate compiler for each target platform: gcc for Linux and Raspberry Pi, clang for macOS, MinGW for Windows.
How do I handle microphone input on different platforms?
Use a cross-platform library like PvRecorder. It abstracts away platform-specific APIs and provides a consistent interface for capturing live audio from microphones on Linux, Windows, macOS, and Raspberry Pi.