The Impossible Possible: Large Language Models in AWS Lambda

🚀 Cross-platform Local LLM

Build AI applications running in the serverless without sending user data to 3rd party remote servers.

Running LLMs in a serverless environment (such as AWS Lambda) is great alternative to using a dedicated server. Why? Well, there are numerous reasons why you'd want to run an application serverless. To name a few:

No need to perform ops and server management.
Serverless functions can easily scale.
Code is typically simpler.
Updates are easier and faster to deploy.

Of course, running an LLM on a serverless platform isn't feasible yet... or is it?

Try the demo

Before we start, go ahead and try it yourself!

Head on over to the serverless-picollm GitHub Repo for instructions on how to setup, deploy and run the demo.

Why Was It Harder Before?

So why isn't it feasible to run LLMs in AWS Lambda?

Well first off, size. It's right there in the name: Large Language Models. For example, the full size of the Llama3 model is around 40GB! Regardless of any technical limitations, AWS Lambda limits the total uncompressed code size to just 250MB - way less than any LLM out there right now. As we will discuss later in the article, using containers we can increase the code package size limit to 10GB. However that is still too small for a lot of LLMs.

The second obstacle is performance. Most LLMs that run in the cloud need powerful dedicated GPUs and large amounts of memory. Although AWS Lambda has been expanding it's capabilities lately, the concept behind serverless is still small, simple, and efficient functions that don't need a lot of resources to run.

Enter picoLLM

Funnily enough, the solution to both those problems is to just make the models smaller! It's not quite that simple though. There are right ways to compress models, and wrong ways to compress models. Look at this graph comparing picoLLM Compression versus GPTQ on the Llama-3-8b model:

You can see that as we continue to compress (quantize) the model, picoLLM Compression maintains similar levels of performance rather than degrading, unlike GPTQ.

If you want more detail on how picoLLM Compression works, you can read the full article on the topic of "Towards Optimal LLM Quantization".

How does this help us? By shrinking the model, we can now fit the model on our lambda function, solving our first issue. Smaller models also mean less computation is needed, which allows us to run them with limited resources, solving our second issue as well!

It's Not All Sunshine☀️ and Roses🌹

Of course it can't always be that easy. There are still a few technical and design hurdles we have to get over in order to create a practical application.

Deploying the Model to AWS Lambda

As mentioned in the start of this article, AWS Lambda limits the total uncompressed code size to just 250MB. Fortunately, we can bypass this limitation by basing our lambda off a Docker Container image. In the past this was a lot of additional work, but luckily modern versions of AWS SAM now perform all the actions of setting up the ECR container repository, building the Docker image, pushing the Docker image, and managing the tags.

You can see that even our Dockerfile is relatively simple:

FROM public.ecr.aws/lambda/python:3.12

RUN dnf install -y libgomp

COPY src/requirements.txt .
RUN python3 -m pip install -r requirements.txt

COPY resources/* /

COPY src/app.py ${LAMBDA_TASK_ROOT}

CMD ["app.handler"]

Cold Starts and Initialization

A "cold start" is the the initial delay experienced when an AWS Lambda function is invoked for the first time or after being idle. Usually cold starts for lambda are measured in the hundreds of milliseconds. In our case if we aren't careful, it can be in the tens of seconds. This can result in poor user experience.

There are two issues we need to address, one is the large size of our container.

Since AWS Lambda has to download this container from the ECR, the larger the image the worse the potential impact on the lambda start. The good news is that AWS Lambda will cache the image layers, so this is only an issue for the first call after an initial deployment or very cold lambdas. Here, size really matters. As you can see in this graph, larger models take significantly longer to sync from ECR:

The size of the model increases the cold start time (blue) exponentially. Warm start times (orange) scale more linearly. The smaller you can get your model, the quicker you can get your lambda function up and rolling.

The nature of lambda is that each invocation could run on a different server with varying hardware, these numbers should be looked at as trends, not taken verbatim.

The second is that LLM initialization is complex, so it can take a couple of seconds to initialize. We can minimize the impact of this by placing our initialized instance of picoLLM at the global level, caching it between subsequent calls to the warm lambda. Unfortunately since the initialization time from a completely cold lambda is typically over the 10 seconds allotted by AWS Lambda, we have to do this JIT when we handle our message:

pllm = None


def load_picollm(connection_id, apigw_client):
    global pllm

    if pllm is None:
        # ...
        pllm = picollm.create(
            access_key=ACCESS_KEY,
            model_path=model_path,
            device='cpu:5')

def handle_message(prompt, connection_id, apigw_client):
    try:
        load_picollm(connection_id, apigw_client)
    except Exception as e:
        # ...

Another way to improve our cold starts is to actually minimize the number of cold starts. To do this we use provisioned concurrency. With provisioned concurrency set up, once a lambda is warm it will stay warm indefinitely.

Streaming the Responses

Another potential user experience issue we need to address is streaming the responses. If a user was forced to wait for the model to generate a whole completion, it could be tens of seconds before they receive a response. Unless we use some other method, we can only return from the lambda once. We can use WebSockets to sort this issue out.

Using WebSockets allows us to build an asynchronous application. We can send back each completion callback as an separate message, allowing the client to see the response from picoLLM in real time:

def stream_callback(token: str):
    # ...
    send_message({"action": "completion", "msg": token}, connection_id, apigw_client)

# ...
try:
    res = pllm.generate(
        prompt,
        completion_token_limit=256,
        presence_penalty=3,
        frequency_penalty=0,
        temperature=0.7,
        top_p=0.6,
        stream_callback=stream_callback)
except Exception as e:
    # ...

Conclusion

picoLLM Compression makes LLM models much smaller without losing performance, enabling all sorts of new potential deployment opportunities. One such opportunity is AWS Lambda, as demonstrated in this article.

There were some difficulties to overcome, but through either clever application design we were able to mitigate or bypass all of them.

Still, there are a few features that AWS could add to Lambda that would improve things a lot:

Allow users to increase the initialization timeout: This is currently set to 10s, which is a bit to short for our use-case. If we could increase this, we could initialize picoLLM as part of the lambda initialization, eliminating the initial delay of the JIT initialization.
Allow users to set CPU cores independent of memory: Currently we need to set our memory to the max allowed value of 10GB in order to get the most CPU performance. However, these models are small, and therefore don't need all that excess memory.

What's Next?

Try it on AWS Lambda yourself

If you haven't already, try the companion example demo found in the serverless-picollm GitHub Repo.

Start Building

Picovoice is founded and operated by engineers. We love developers who are not afraid to get their hands dirty and are itching to build. picoLLM is 💯 free for open-weight models. We promise never to make you talk to a salesperson or ask for a credit card. We currently support the Gemma 👧, Llama 🦙, Mistral ⛵, Mixtral 🍸, and Phi φ families of LLMs, and support for many more is underway.

o = picollm.create(
    access_key,
    model_path)

res = o.generate(prompt)
Build with Python