Running LLMs
in a serverless
environment (such as AWS Lambda
) is great alternative to using a dedicated server. Why? Well, there are numerous reasons why you'd want to run an application serverless. To name a few:
- No need to perform ops and server management.
Serverless
functions can easily scale.- Code is typically simpler.
- Updates are easier and faster to deploy.
Of course, running an LLM
on a serverless
platform isn't feasible yet... or is it?
Try the demo
Before we start, go ahead and try it yourself!
Head on over to the serverless-picollm
GitHub Repo for instructions on how to setup,
deploy and run the demo.
Why Was It Harder Before?
So why isn't it feasible to run LLMs
in AWS Lambda?
Well first off, size. It's right there in the name: Large Language Models.
For example, the full size of the Llama3 model is around 40GB
!
Regardless of any technical limitations, AWS Lambda limits the total uncompressed code size to just 250MB
- way less than any LLM
out there right now. As we will discuss later in the article,
using containers we can increase the code package size limit to 10GB
.
However that is still too small for a lot of LLMs
.
The second obstacle is performance. Most LLMs
that run in the cloud need powerful dedicated GPUs and large amounts of memory.
Although AWS Lambda has been expanding it's capabilities lately, the concept behind serverless
is still small, simple,
and efficient functions that don't need a lot of resources to run.
Enter picoLLM
Funnily enough, the solution to both those problems is to just make the models smaller!
It's not quite that simple though. There are right ways to compress models, and wrong ways to compress models.
Look at this graph comparing picoLLM Compression
versus GPTQ
on the Llama-3-8b
model:
You can see that as we continue to compress (quantize) the model,
picoLLM Compression
maintains similar levels of performance rather than degrading, unlike GPTQ
.
If you want more detail on how picoLLM Compression
works,
you can read the full article on the topic of "Towards Optimal LLM Quantization".
How does this help us? By shrinking the model, we can now fit the model on our lambda function, solving our first issue. Smaller models also mean less computation is needed, which allows us to run them with limited resources, solving our second issue as well!
It's Not All Sunshine☀️ and Roses🌹
Of course it can't always be that easy. There are still a few technical and design hurdles we have to get over in order to create a practical application.
Deploying the Model to AWS Lambda
As mentioned in the start of this article, AWS Lambda limits the total uncompressed code size to just 250MB
.
Fortunately, we can bypass this limitation by basing our lambda off a Docker Container image.
In the past this was a lot of additional work, but luckily modern versions of AWS SAM now perform all the actions
of setting up the ECR container repository, building the Docker image, pushing the Docker image, and managing the tags.
You can see that even our Dockerfile is relatively simple:
Cold Starts and Initialization
A "cold start" is the the initial delay experienced when an AWS Lambda function is invoked for the first time or after being idle. Usually cold starts for lambda are measured in the hundreds of milliseconds. In our case if we aren't careful, it can be in the tens of seconds. This can result in poor user experience.
There are two issues we need to address, one is the large size of our container.
Since AWS Lambda has to download this container from the ECR, the larger the image the worse the potential impact on the lambda start. The good news is that AWS Lambda will cache the image layers, so this is only an issue for the first call after an initial deployment or very cold lambdas. Here, size really matters. As you can see in this graph, larger models take significantly longer to sync from ECR:
The size of the model increases the cold start time (blue) exponentially. Warm start times (orange) scale more linearly. The smaller you can get your model, the quicker you can get your lambda function up and rolling.
The nature of lambda is that each invocation could run on a different server with varying hardware, these numbers should be looked at as trends, not taken verbatim.
The second is that LLM initialization is complex, so it can take a couple of seconds to initialize.
We can minimize the impact of this by placing our initialized instance of picoLLM
at the global level,
caching it between subsequent calls to the warm lambda. Unfortunately since the initialization time
from a completely cold lambda is typically over the 10 seconds
allotted by AWS Lambda,
we have to do this JIT when we handle our message:
Another way to improve our cold starts is to actually minimize the number of cold starts. To do this we use provisioned concurrency. With provisioned concurrency set up, once a lambda is warm it will stay warm indefinitely.
Streaming the Responses
Another potential user experience issue we need to address is streaming the responses.
If a user was forced to wait for the model to generate a whole completion, it could be tens of seconds before they receive a response.
Unless we use some other method, we can only return from the lambda once.
We can use WebSockets
to sort this issue out.
Using WebSockets
allows us to build an asynchronous application.
We can send back each completion callback as an separate message,
allowing the client to see the response from picoLLM in real time:
Conclusion
picoLLM Compression
makes LLM
models much smaller without losing performance,
enabling all sorts of new potential deployment opportunities.
One such opportunity is AWS Lambda, as demonstrated in this article.
There were some difficulties to overcome, but through either clever application design we were able to mitigate or bypass all of them.
Still, there are a few features that AWS could add to Lambda that would improve things a lot:
- Allow users to increase the initialization timeout: This is currently set to
10s
, which is a bit to short for our use-case. If we could increase this, we could initializepicoLLM
as part of the lambda initialization, eliminating the initial delay of the JIT initialization. - Allow users to set CPU cores independent of memory: Currently we need to set our memory to the max allowed value
of
10GB
in order to get the most CPU performance. However, these models are small, and therefore don't need all that excess memory.
What's Next?
Try it on AWS Lambda yourself
If you haven't already, try the companion example demo found in the serverless-picollm
GitHub Repo.
Start Building
Picovoice is founded and operated by engineers.
We love developers who are not afraid to get their hands dirty and are itching to build.
picoLLM is 💯 free for open-weight models. We promise never to make you talk to a salesperson or ask for a credit card.
We currently support the Gemma
👧, Llama
🦙, Mistral
⛵, Mixtral
🍸, and Phi
φ families of LLMs,
and support for many more is underway.
o = picollm.create(access_key,model_path)res = o.generate(prompt)