🎯 Enterprise LLM Consulting
Work with AI consultants to build LLM-powered apps to improve productivity, retention, and time-to-market.
Consult an AI Expert

Running large language models (LLMs) locally gives you on-device inference, low-latency responses, and full control over your data. To deploy LLMs on desktop or embedded systems, you need a quantized model file produced using methods such as GPTQ, AWQ, LLM.int8(), SqueezeLLM, etc., along with an inference engine.

If you've explored local LLM deployment, you've likely encountered llama.cpp and Ollama—two popular open-source options. While these work well for experimentation, they present challenges for production C applications. Ollama lacks C bindings entirely, complicating native application integration and embedded system deployment. Both rely on community support with limited documentation, leaving enterprise developers to independently optimize performance, maintain, and assess security for production environments.

For cross-platform LLM applications in C, picoLLM provides a native C API with comprehensive documentation that runs on Windows, macOS, Linux, and Raspberry Pi. It's designed for memory-constrained and compute-limited devices, making it suitable for embedded LLM inference and edge AI deployment.

This tutorial shows how to build a production-ready application using the picoLLM C SDK for on-device AI. You'll learn how to implement local LLM inference that runs consistently across Windows, macOS, Linux, and Raspberry Pi from a single C codebase.


Tutorial Project Prerequisites

  • C99-compatible compiler
  • Sufficient disk space for quantized model files (varies by model, typically 1-30 GB)

Supported platforms:

  • Linux (x86_64)
  • macOS (x86_64, arm64)
  • Windows (x86_64, arm64)
  • Raspberry Pi (4, 5)

Part 1. Set Up Your Project Structure

To keep things simple, we'll use the following directory structure:

If you choose to organize your files differently, update the paths in the examples accordingly.

Step 1. Download a Quantized Language Model

  1. Go to Picovoice Console and create an account.
  2. Navigate to the picoLLM page and download a quantized model.
  3. Place the downloaded model file (.pllm) in:

picoLLM supports various language models including Phi, Gemma, Llama, Mistral, and more. Choose a model appropriate for your hardware capabilities and use case.

Step 2. Add picoLLM C Library Header Files

The picoLLM C API requires header files that define the function signatures. Download the picoLLM header files from GitHub and place them in:

Step 3. Include Required Headers

Here are the headers we'll need to build the demo application:

Part 2. Dynamic Library Loading

picoLLM distributes pre-built platform-specific shared libraries, which means:

  • The shared library is not linked at compile time
  • Your C program loads the library dynamically at runtime using platform APIs
  • Function pointers must be retrieved by symbol name from the loaded library

This approach enables cross-platform compatibility without recompiling for each operating system. We'll implement helper functions to:

  1. Open the shared library using platform-specific APIs
  2. Look up function pointers by name
  3. Close the library when finished

Step 1. Add the Shared Library File

Download the appropriate platform-specific library file for your system and place it in:

The file should have the correct extension: .so - Linux, .dylib - macOS, or .dll - Windows.

Step 2. Include Platform-Specific Headers for Dynamic Loading

Understanding the cross-platform headers:

  • On Windows systems, windows.h provides LoadLibrary() to load DLL files and GetProcAddress() to retrieve function pointers from the loaded library.
  • On Unix-based systems (Linux, macOS), dlopen() and dlsym() from dlfcn.h provide equivalent functionality for loading shared libraries (.so, .dylib).
  • signal.h enables handling Ctrl-C (SIGINT) interruptions, which we'll implement later to handle interrupting LLM text generation.

Step 3. Add Cross-Platform Dynamic Loading Helper Functions

These wrapper functions abstract away platform differences, making your code portable across operating systems.

Open the Shared Library (LoadLibrary on Windows, dlopen on Unix)

Load Function Symbols from the Library

Close the Library and Free Resources

Step 4. Load the picoLLM Shared Library at Runtime

Step 5. Load Required picoLLM API Functions

Before calling any picoLLM functions, you must load them from the shared library:

We'll explain each function in detail as we use them in the text generation workflow.

Part 3. Implement Local LLM Inference in C

Now that we've set up dynamic loading, we can use the picoLLM API to run language model inference locally on your machine without any cloud dependencies.

Step 1. Initialize the Local LLM Engine

Initialize the picoLLM engine with your model file and access credentials:

  1. Copy your AccessKey from Picovoice Console
  2. Replace ${ACCESS_KEY} with your actual AccessKey
  3. Update model_path to point to your downloaded picoLLM model file (.pllm)

Call pv_picollm_init to create a picoLLM instance:

Explanation of initialization parameters:

  • access_key: Your Picovoice Console AccessKey for authentication
  • model_path: Filesystem path to your quantized picoLLM model file (.pllm format)
  • device_string: Inference device selection. Use "best" for automatic selection (GPU if available, otherwise CPU), "cpu" to force CPU-only inference, or "gpu" for GPU acceleration if your hardware supports it.

Step 2. Set Up Callback for Streaming Token Generation

Define a streaming callback function to receive generated tokens in real-time as the LLM produces them, enabling a streaming ChatGPT-style user experience:

Step 3. Get User Prompt Input for LLM Inference

Read the user's prompt from standard input:

This example uses a fixed buffer for simplicity. In production applications, consider dynamic memory allocation to handle prompts of any length.

Step 4. Configure Text Generation Parameters

Configure the following parameters to control the LLM's text generation behavior, creativity, and output characteristics:

Refer to the official picoLLM C API documentation for complete details.

Step 5. Generate Text with Streaming LLM Inference

Call the generation function to run local LLM inference with real-time streaming output:

Explanation of the generation function:

  • pv_picollm_generate: Main API function for running LLM text generation and inference
  • Returns detailed usage statistics (token counts), endpoint information, and the complete generated text string
  • Streaming happens in real-time through the callback function during generation
  • completion_tokens provides detailed metadata about each generated token

Step 6. Handle User Interruptions During Generation

Allow users to gracefully interrupt long-running LLM inference with Ctrl-C:

Explanation of interrupt handling:

  • pv_picollm_interrupt: Gracefully stops text generation without corrupting state
  • The signal handler catches Ctrl-C (SIGINT) and calls the interrupt function

Step 7. Cleanup and Free Memory Resources

When finished with LLM inference, free all allocated resources:


Complete Example: Local LLM Inference in C

Here is the complete picollm_tutorial.c implementation you can copy, compile, and run on Windows, Linux, macOS, and Raspberry Pi:

picollm_tutorial.c

Before compiling and running:

  • Replace PV_ACCESS_KEY with your AccessKey from Picovoice Console
  • Update PICOLLM_MODEL_PATH to point to your downloaded quantized model file (.pllm)
  • Update PICOLLM_LIBRARY_PATH to point to the correct platform-specific library (.so for Linux, .dylib for macOS, .dll for Windows)

This is a simplified example but contains all the necessary pieces to get you started. Check out the picoLLM C demo on GitHub for a more complete demo application.

Build & Run

Build and run the application:

Linux and Raspberry Pi

macOS

Windows


Troubleshooting Common Issues

Library Loading Fails

Problem: The dynamic loader cannot find or open the shared library file.

Solution:

  • Verify the library file matches your platform and architecture: .so (Linux), .dylib (macOS), .dll (Windows)
  • Check that the library file exists at the specified path

Slow Generation Speed or High Memory Usage

Problem: Text generation is extremely slow or the program uses excessive memory.

Solution:

  • Try a smaller quantized model
  • Set device_string to cpu if GPU acceleration is causing issues
  • Reduce completion_token_limit to limit generation length
  • Close other applications to free up RAM
  • Ensure you are freeing memory resources properly

Generation Produces Gibberish or Repeated Text

Problem: The LLM output is incoherent, repetitive, or nonsensical.

Solution:

  • Lower the temperature parameter for more focused output
  • Increase frequency_penalty to reduce repetition
  • Adjust top_p to a lower value for more deterministic output
  • Verify that your prompt is well-formatted and clear
  • Try a different or larger model if output quality remains poor

Next Steps: Building a Complete Voice AI Assistant

With local LLM inference in place, you can build a complete voice-driven assistant by adding speech input, speech output, wake word detection, and optional MCP tool invocation.

  • Wake word detection (Porcupine): Acts as the entry point to the assistant. It continuously monitors audio and activates the pipeline only when a predefined wake phrase is detected, keeping the system idle otherwise.

  • Speech-to-text (Cheetah Streaming STT): Handles user input after activation by converting spoken queries into text that can be passed directly to the LLM for interpretation and response generation.

  • Integration with MCP (optional): Serves as the decision and reasoning layer of the assistant. It generates natural-language responses and, when needed, invokes external tools or system actions through the Model Context Protocol (MCP).

  • Text-to-speech (Orca Streaming TTS): Converts generated text into audible speech, allowing responses to be spoken aloud back to the user in real time.

Start Building

Frequently Asked Questions

Can I run local LLM inference without an internet connection?

Once you've downloaded the model file and obtained your AccessKey, picoLLM runs inference completely offline. The AccessKey validation happens during initialization and can work offline after initial verification. All inference runs locally on your device without any external API calls.

Which quantized models work best for embedded systems and Raspberry Pi?

Smaller models like Phi-2, Gemma-2B, or Llama-3.2-1B work well on devices with 4-8GB RAM. These quantized models are optimized for resource-constrained environments while maintaining reasonable output quality.

How does temperature affect LLM text generation quality?

Temperature controls randomness in token selection. Low temperature (0.1-0.3) makes output deterministic and focused, ideal for factual tasks, code generation, or structured data extraction. High temperature (0.7-1.0) increases creativity and diversity, better for creative writing or brainstorming.

How do I run an LLM locally in C?

To run an LLM in C, you need a quantized model file, an inference engine with C bindings, and hardware acceleration support. Start by downloading a quantized model—these compressed versions reduce memory usage and improve inference speed on local hardware. Next, choose an inference engine that provides native C bindings. Popular options include llama.cpp, which offers a C-compatible API, and picoLLM, which provides a native C SDK designed for cross-platform deployment on Windows, macOS, Linux, and Raspberry Pi. The basic workflow involves initializing the inference engine, loading your model into memory, configuring generation parameters like temperature and max tokens, and calling the inference function with your prompt. Most engines support streaming callbacks for token-by-token output. For production applications, you'll need to handle memory management, error checking, threading, and hardware acceleration. The picoLLM C SDK includes built-in support for these requirements with comprehensive documentation for local LLM inference in C.

Can I use picoLLM with GPU acceleration for faster inference?

Yes, picoLLM supports GPU acceleration on compatible hardware. Set device_string to "gpu" or "best" (automatic selection). GPU acceleration significantly speeds up inference, especially for larger models.