🚀 Best-in-class Voice AI!
Build compliant and low-latency AI apps running within web browsers without sending user data to 3rd party servers.
Start Free
Model used: Phi-2
Hello, Phi-2!
Hello! Start the demo to begin a conversation.

Running LLMs locally within web browsers opens up many use cases, such as summarization, proofreading, text generation, and question-answering. However, local LLM inference within web browsers is fraught with challenges. LLMs are large; retrieving them from the server for the client is slow and bandwidth-intensive. Moreover, the substantial RAM requirements of these LLMs pose a significant hurdle when running in the highly regulated environment of a web browser. Lastly, LLMs require extensive computing resources to deliver a reasonable speed.

The current remedy is to compress LLMs using an LLM quantization algorithm such as GPTQ to alleviate their bandwidth, storage, and RAM requirements. Then, WebGPU can be used to run LLMs locally. The WebLLM project uses this approach. However, it comes with shortcomings as (1) current quantization algorithms incur performance loss, and (2) WebGPU doesn't work across all major browsers.

The deep learning team within Picovoice recently developed picoLLM Compression. This new LLM quantization technique significantly outperforms the SOTA algorithms. picoLLM Compression effectively addresses the first issue above as it can deeply compress models with minimum accuracy loss. What follows is our contribution when developing the web backend for the accompanying inference engine, picoLLM Inference Engine, to create the first-ever cross-browser local LLM inference engine.

picoLLM Inference Engine runs on Linux, macOS, Windows, Raspberry Pi, Android, iOS, Chrome, Safari, Edge, Firefox, and Opera.

Are you a deep learning researcher? Learn how picoLLM Compression deeply quantizes LLMs while minimizing loss by optimally allocating bits across and within weights.

Cross-Browser Compatible Local LLM

The leading project that supports in-browser LLM inference is WebLLM, which relies on WebGPU. The table below shows the browsers that WebGPU and picoLLM Inference Engine's browser support.

Web BrowserWebGPU [WebLLM]picoLLM Inference
Chrome
Safari
Edge
Firefox
Opera

In the table above, if a browser needs a special experimental feature enabled by the end user, we mark it as unsupported, as it is not rational to push a public-facing enterprise web application expecting end users to act like web developers. If a browser supports a required feature on only some operating systems but not all of them, we mark it as yellow.

Fast LLM Inference across Browsers

picoLLM Inference Engine is not only cross-browser compatible but also speedy. The figure below shows the speed of the Microsoft Phi-2 model running on MacBook Air M1 using picoLLM Compression Engine.

picoLLM Inference Speed across Browsers

Rocket Science under the Hood

picoLLM Inference Engine is available to front-end developers via a simple JavaScript SDK. But getting there has been a journey! Our first version was a straight-up compilation of picoLLM Inference Engine's C code to WebAssembly, and the result could have been better. This version could have generated 0.6 tokens per second on a MacBook Air M1.

To put things into perspective, 4-5 tokens per second is where the LLM becomes useful for an end user because that is the speed we can read.

We first focused our effort on code optimization. We started using WebAssembly SIMD, which uses the native (x86 or Arm) SIMD instructions. SIMD stands for Single Instruction Multiple Data and means using CPU instructions to simultaneously perform a single operation (e.g., addition) on multiple data (e.g., an array of floating-point numbers). WebAssembly SIMD sped up the application to 4.1 tokens per second, the lower bound of minimum speed requirements.

Our final challenge was implementing parallel processing, a more complex task in a web environment. JavaScript, being single-threaded, presented a unique obstacle. However, we overcame this by using WebWorkers to construct our threading mechanism. This innovation significantly boosted the speed of the Inference Engine to 11.5 tokens per second, a performance level on par with native code on the same machine. The figure below provides a visual summary of this remarkable progress.

picoLLM Uses WebAssembly SIMD and WebWorker Parallelism to Speedup the Inference

Are you a software engineer? Learn how picoLLM Inference Engine runs x-bit quantized Transformers on CPU and GPU across Linux, macOS, Windows, iOS, Android, Raspberry Pi, and Web [🧑‍💻].

Isn't WebGPU Faster than WebAssembly?

It depends. WebGPU is a high-level abstraction for any GPU, and the speed can vary depending on the type of GPU available. For example, the figure below compares the speed of picoLLM and WebGPU using a MacBook Air M1 (WebGPU on Apple Metal) and a Windows workstation with a powerful NVIDIA RTX 4070 Ti Super. We are running Phi-2 model in the example below.

picoLLM vs WebLLM (WebGPU) LLM Inference Speed Comparison

Start Building

Picovoice is founded and operated by engineers. We love developers who are not afraid to get their hands dirty and are itching to build. picoLLM is 💯 free for open-weight models. We promise never to make you talk to a salesperson or ask for a credit card. We currently support the Gemma 👧, Llama 🦙, Mistral ⛵, Mixtral 🍸, and Phi φ families of LLMs, and support for many more is underway.

o = picollm.create(
access_key,
model_path)
res = o.generate(prompt)