Running large language models (LLMs) locally gives you on-device inference, low-latency responses, and full control over your data. To deploy LLMs on desktop or embedded systems, you need a quantized model file produced using methods such as GPTQ, AWQ, LLM.int8(), SqueezeLLM, etc., along with an inference engine.
If you've explored local LLM deployment, you've likely encountered llama.cpp and Ollama—two popular open-source options. While these work well for experimentation, they present challenges for production C applications. Ollama lacks C bindings entirely, complicating native application integration and embedded system deployment. Both rely on community support with limited documentation, leaving enterprise developers to independently optimize performance, maintain, and assess security for production environments.
For cross-platform LLM applications in C, picoLLM provides a native C API with comprehensive documentation that runs on Windows, macOS, Linux, and Raspberry Pi. It's designed for memory-constrained and compute-limited devices, making it suitable for embedded LLM inference and edge AI deployment.
This tutorial shows how to build a production-ready application using the picoLLM C SDK for on-device AI. You'll learn how to implement local LLM inference that runs consistently across Windows, macOS, Linux, and Raspberry Pi from a single C codebase.
Tutorial Project Prerequisites
- C99-compatible compiler
- Sufficient disk space for quantized model files (varies by model, typically 1-30 GB)
Supported platforms:
- Linux (x86_64)
- macOS (x86_64, arm64)
- Windows (x86_64, arm64)
- Raspberry Pi (4, 5)
Part 1. Set Up Your Project Structure
To keep things simple, we'll use the following directory structure:
If you choose to organize your files differently, update the paths in the examples accordingly.
Step 1. Download a Quantized Language Model
- Go to Picovoice Console and create an account.
- Navigate to the picoLLM page and download a quantized model.
- Place the downloaded model file (
.pllm) in:
picoLLM supports various language models including Phi, Gemma, Llama, Mistral, and more. Choose a model appropriate for your hardware capabilities and use case.
Step 2. Add picoLLM C Library Header Files
The picoLLM C API requires header files that define the function signatures. Download the picoLLM header files from GitHub and place them in:
Step 3. Include Required Headers
Here are the headers we'll need to build the demo application:
Part 2. Dynamic Library Loading
picoLLM distributes pre-built platform-specific shared libraries, which means:
- The shared library is not linked at compile time
- Your C program loads the library dynamically at runtime using platform APIs
- Function pointers must be retrieved by symbol name from the loaded library
This approach enables cross-platform compatibility without recompiling for each operating system. We'll implement helper functions to:
- Open the shared library using platform-specific APIs
- Look up function pointers by name
- Close the library when finished
Step 1. Add the Shared Library File
Download the appropriate platform-specific library file for your system and place it in:
The file should have the correct extension: .so - Linux, .dylib - macOS, or .dll - Windows.
Step 2. Include Platform-Specific Headers for Dynamic Loading
Understanding the cross-platform headers:
- On Windows systems,
windows.hprovidesLoadLibrary()to load DLL files andGetProcAddress()to retrieve function pointers from the loaded library. - On Unix-based systems (Linux, macOS),
dlopen()anddlsym()fromdlfcn.hprovide equivalent functionality for loading shared libraries (.so,.dylib). signal.henables handlingCtrl-C(SIGINT) interruptions, which we'll implement later to handle interrupting LLM text generation.
Step 3. Add Cross-Platform Dynamic Loading Helper Functions
These wrapper functions abstract away platform differences, making your code portable across operating systems.
Open the Shared Library (LoadLibrary on Windows, dlopen on Unix)
Load Function Symbols from the Library
Close the Library and Free Resources
Print Platform-Correct Error Messages
Step 4. Load the picoLLM Shared Library at Runtime
Step 5. Load Required picoLLM API Functions
Before calling any picoLLM functions, you must load them from the shared library:
We'll explain each function in detail as we use them in the text generation workflow.
Part 3. Implement Local LLM Inference in C
Now that we've set up dynamic loading, we can use the picoLLM API to run language model inference locally on your machine without any cloud dependencies.
Step 1. Initialize the Local LLM Engine
Initialize the picoLLM engine with your model file and access credentials:
- Copy your
AccessKeyfrom Picovoice Console - Replace
${ACCESS_KEY}with your actualAccessKey - Update
model_pathto point to your downloaded picoLLM model file (.pllm)
Call pv_picollm_init to create a picoLLM instance:
Explanation of initialization parameters:
access_key: Your Picovoice ConsoleAccessKeyfor authenticationmodel_path: Filesystem path to your quantized picoLLM model file (.pllmformat)device_string: Inference device selection. Use"best"for automatic selection (GPU if available, otherwise CPU),"cpu"to force CPU-only inference, or"gpu"for GPU acceleration if your hardware supports it.
Step 2. Set Up Callback for Streaming Token Generation
Define a streaming callback function to receive generated tokens in real-time as the LLM produces them, enabling a streaming ChatGPT-style user experience:
Step 3. Get User Prompt Input for LLM Inference
Read the user's prompt from standard input:
This example uses a fixed buffer for simplicity. In production applications, consider dynamic memory allocation to handle prompts of any length.
Step 4. Configure Text Generation Parameters
Configure the following parameters to control the LLM's text generation behavior, creativity, and output characteristics:
Refer to the official picoLLM C API documentation for complete details.
Step 5. Generate Text with Streaming LLM Inference
Call the generation function to run local LLM inference with real-time streaming output:
Explanation of the generation function:
- pv_picollm_generate: Main API function for running LLM text generation and inference
- Returns detailed usage statistics (token counts), endpoint information, and the complete generated text string
- Streaming happens in real-time through the callback function during generation
- completion_tokens provides detailed metadata about each generated token
Step 6. Handle User Interruptions During Generation
Allow users to gracefully interrupt long-running LLM inference with Ctrl-C:
Explanation of interrupt handling:
- pv_picollm_interrupt: Gracefully stops text generation without corrupting state
- The signal handler catches
Ctrl-C(SIGINT) and calls the interrupt function
Step 7. Cleanup and Free Memory Resources
When finished with LLM inference, free all allocated resources:
Complete Example: Local LLM Inference in C
Here is the complete picollm_tutorial.c implementation you can copy, compile, and run on Windows, Linux, macOS, and Raspberry Pi:
picollm_tutorial.c
Before compiling and running:
- Replace
PV_ACCESS_KEYwith yourAccessKeyfrom Picovoice Console - Update
PICOLLM_MODEL_PATHto point to your downloaded quantized model file (.pllm) - Update
PICOLLM_LIBRARY_PATHto point to the correct platform-specific library (.sofor Linux,.dylibfor macOS,.dllfor Windows)
This is a simplified example but contains all the necessary pieces to get you started. Check out the picoLLM C demo on GitHub for a more complete demo application.
Build & Run
Build and run the application:
Linux and Raspberry Pi
macOS
Windows
Troubleshooting Common Issues
Library Loading Fails
Problem: The dynamic loader cannot find or open the shared library file.
Solution:
- Verify the library file matches your platform and architecture:
.so(Linux),.dylib(macOS),.dll(Windows) - Check that the library file exists at the specified path
Slow Generation Speed or High Memory Usage
Problem: Text generation is extremely slow or the program uses excessive memory.
Solution:
- Try a smaller quantized model
- Set
device_stringtocpuif GPU acceleration is causing issues - Reduce
completion_token_limitto limit generation length - Close other applications to free up RAM
- Ensure you are freeing memory resources properly
Generation Produces Gibberish or Repeated Text
Problem: The LLM output is incoherent, repetitive, or nonsensical.
Solution:
- Lower the
temperatureparameter for more focused output - Increase
frequency_penaltyto reduce repetition - Adjust
top_pto a lower value for more deterministic output - Verify that your prompt is well-formatted and clear
- Try a different or larger model if output quality remains poor
Next Steps: Building a Complete Voice AI Assistant
With local LLM inference in place, you can build a complete voice-driven assistant by adding speech input, speech output, wake word detection, and optional MCP tool invocation.
Wake word detection (Porcupine): Acts as the entry point to the assistant. It continuously monitors audio and activates the pipeline only when a predefined wake phrase is detected, keeping the system idle otherwise.
Speech-to-text (Cheetah Streaming STT): Handles user input after activation by converting spoken queries into text that can be passed directly to the LLM for interpretation and response generation.
Integration with MCP (optional): Serves as the decision and reasoning layer of the assistant. It generates natural-language responses and, when needed, invokes external tools or system actions through the Model Context Protocol (MCP).
Text-to-speech (Orca Streaming TTS): Converts generated text into audible speech, allowing responses to be spoken aloud back to the user in real time.
Frequently Asked Questions
Once you've downloaded the model file and obtained your AccessKey, picoLLM runs inference completely offline. The AccessKey validation happens during initialization and can work offline after initial verification. All inference runs locally on your device without any external API calls.
Smaller models like Phi-2, Gemma-2B, or Llama-3.2-1B work well on devices with 4-8GB RAM. These quantized models are optimized for resource-constrained environments while maintaining reasonable output quality.
Temperature controls randomness in token selection. Low temperature (0.1-0.3) makes output deterministic and focused, ideal for factual tasks, code generation, or structured data extraction. High temperature (0.7-1.0) increases creativity and diversity, better for creative writing or brainstorming.
To run an LLM in C, you need a quantized model file, an inference engine with C bindings, and hardware acceleration support. Start by downloading a quantized model—these compressed versions reduce memory usage and improve inference speed on local hardware. Next, choose an inference engine that provides native C bindings. Popular options include llama.cpp, which offers a C-compatible API, and picoLLM, which provides a native C SDK designed for cross-platform deployment on Windows, macOS, Linux, and Raspberry Pi. The basic workflow involves initializing the inference engine, loading your model into memory, configuring generation parameters like temperature and max tokens, and calling the inference function with your prompt. Most engines support streaming callbacks for token-by-token output. For production applications, you'll need to handle memory management, error checking, threading, and hardware acceleration. The picoLLM C SDK includes built-in support for these requirements with comprehensive documentation for local LLM inference in C.
Yes, picoLLM supports GPU acceleration on compatible hardware. Set device_string to "gpu" or "best" (automatic selection). GPU acceleration significantly speeds up inference, especially for larger models.







