Enterprise developers building real-time speech recognition in C face challenges with audio pipeline complexity, memory management, and cross-platform compatibility. Streaming speech-to-text (STT) solutions must also work reliably on Linux, Windows, macOS, and Raspberry Pi while maintaining minimal latency.
Cloud-based services like Azure Real-Time STT, Amazon Transcribe Streaming, and Google Streaming ASR require constant internet connectivity and send audio data to remote servers. For applications handling sensitive audio or that run in environments with unreliable internet connection, on-device speech recognition is essential.
This tutorial shows how to implement cross-platform streaming speech to text in C using Cheetah Streaming Speech-to-Text, an on-device engine compatible with Linux, Windows, macOS, and Raspberry Pi. You'll learn to capture live microphone input, process audio frames in real time, and generate accurate transcriptions—all from a single codebase.
By the end, you'll have a working C application that performs real-time transcription across all major platforms with a single codebase.
Important: This guide builds on How to Record Audio in C. If you haven't completed that setup yet, start with that tutorial to get your recording environment in place.
Prerequisites
- C99-compatible compiler
- Windows: MinGW
Supported Platforms
- Linux (x86_64)
- macOS (x86_64, arm64)
- Windows (x86_64, arm64)
- Raspberry Pi (3, 4, 5)
Project Setup
This is the folder structure used in this tutorial. You can organize your files differently if you like, but make sure to update the paths in the examples accordingly:
To set up audio capture (pvrecorder), refer to How to Record Audio in C.
Step 1. Add Cheetah library files
- Create a folder named
cheetah/. - Download the Cheetah header files from GitHub and place them in:
- Download a Cheetah model file and the correct library file for your platform and place them in:
If your application needs to recognize custom vocabulary, boost recognition of specific phrases, and custom pronunciations, train a custom STT model instead of using one of the default model files.
Implement Dynamic Loading
Cheetah distributes pre-built platform libraries, meaning:
- the shared library (
.so,.dylib,.dll) is not linked at compile time - the program loads it at runtime
- functions must be retrieved by name
So, we need to write small helper functions to:
- open the shared library
- look up function pointers
- close the library
Step 2. Include platform-specific headers
Why these matter
- On Windows systems,
windows.hprovides theLoadLibraryfunction to load a shared library andGetProcAddressto retrieve individual function pointers. - On Unix-based systems,
dlopenanddlsymfrom thedlfcn.hheader provide the same functionality. - Lastly,
signal.hallows us to handleCtrl-Clater in this example.
Step 3. Define dynamic loading helper functions
3a. Open the shared library
3b. Load function symbols
3c. Close the library
3d. Print platform-correct errors
Implement Streaming Speech-to-Text
Now that we've set up dynamic loading, we can actually use the Cheetah API.
Step 4. Load the library file
Downloaded the correct library file for your platform and point library_path to the file.
Step 5. Initialize Cheetah
- Sign up for an account on Picovoice Console for free and obtain your
AccessKey - Replace
${ACCESS_KEY}with yourAccessKey - Download a model file and point
model_pathto the file. You can choose between default and fast models for each supported language.
Call pv_cheetah_init to create a Cheetah instance:
Explanation of parameters:
access_key: Picovoice ConsoleAccessKeymodel_path: Choose desired language model or train a custom modelendpoint_duration_sec: Duration of endpoint in seconds. A speech endpoint is detected when there is a segment of audio (with a duration specified herein) after an utterance without any speech in it. Set to0to disable endpoint detection.enable_automatic_punctuation: Set totrueto enable automatic punctuation insertion.
Step 6. Transcribe audio
Pass recorded audio frames (with PvRecorder) to Cheetah for processing:
Explanation:
- pv_cheetah_frame_length: Required number of samples per frame.
- pv_cheetah_process: Buffers audio until sufficient context is available. Once sufficient context is available, it returns a partial transcript; otherwise, it returns
NULL.is_endpoint: Indicates a natural pause in speech, marking a possible end of an utterance.
- pv_cheetah_flush: Transcribes any remaining buffered audio.
Step 7. Cleanup
When done, delete Cheetah to free memory:
Complete Example: On-device Streaming Transcription in C
Here is the complete cheetah_tutorial.c you can copy, build, and run (complete with PvRecorder):
- Replace
${ACCESS_KEY}with yourAccessKeyfrom Picovoice Console - update
model_pathto point to theCheetahmodel file (.pv) - update
library_pathto point to the correctCheetahlibrary for your platform - update
pv_recorder_library_pathto point to the correctPvRecorderlibrary for your platform
This is a simplified example but includes all the necessary components to get started. Check out the Cheetah C demo on GitHub for a complete demo application.
Build & Run
Build and run the application:
Linux (gcc) and Raspberry Pi (gcc)
macOS (clang)
Windows (MinGW)
Troubleshooting Common Issues
1. Speech-to-Text Returns Silence or No Transcription
Make sure you're capturing audio from the correct microphone. If you're using PvRecorder, check that it's set up properly before proceeding.
2. Partial Words or Truncated Transcriptions
If words appear cut off or transcriptions seem incomplete, you may be terminating the audio stream before all buffered audio has been processed. Cheetah Streaming Speech-to-Text maintains an internal buffer to ensure accurate context-based recognition.
Solution: Always call pv_cheetah_flush after you've finished streaming audio. This function processes any remaining buffered audio and returns the final transcript segment.
3. Increase Transcription Speed
If you need faster transcriptions than the default model provides, consider using a Cheetah fast model instead.
Solution: Switch to a fast model variant designed for lower latency. Fast models process audio more quickly with a minor reduction in accuracy—typically acceptable for real-time applications where responsiveness is critical.
4. Library Initialization Fails on Target Platform
If Cheetah fails to initialize, you may be using an incorrect library binary for your system architecture.
Solution: Download the correct library file for your specific platform and architecture combination (e.g., Linux x86_64, macOS ARM64, Windows x86_64, Raspberry Pi). The library file extension varies by platform: .so (Linux), .dylib (macOS), .dll (Windows).







