Serverless machine learning: where should one store their models?

Question

I'm deploying a serverless NLP app, made using BERT. I'm currently using Serverless Framework and AWS ECR to overcome AWS Lambda deployment package limit of 250 MB (PyTorch already occupies more than that space).

I'm quite happy with this solution as it allows me to simply dockerize my app, upload it to ECR and worry about nothing else.

One doubt I have is where should I store the models. My app uses 3 different saved models, each with a size of 422 MB. I have two options:

Copy my models in the docker image itself.
- Pros: If I retrain my model it will be automatically updated when I redeploy the app and I don't have to use AWS SDK to load objects from S3
- Cons: Docker image size is very large
Store my models in S3:
- Pros: Image size is smaller than the other solution (1+ GB vs 3+ GB)
- Cons: If I retrain my models I then need to manually update them on S3, as they are decoupled from the app deployment pipeline. Also I need to load them from S3 using AWS SDK (probably adding some overhead?).

So my question ultimately is: of the two solutions, which is the best practice? Why, why not? Is there even a best practice at all or is it based on preferences / need?

Jens · Answer 1 · 2021-08-25T11:27:04.703

3

There is a third option that might be great for you: Store your models on a EFS volume.

EFS volumes are like additional hard drives that you can attach to your Lambda. They can be pretty much as big as you want.

After you trained your model just copy it to your EFS volume. You configure your Lambda to mount that EFS volume when it boots and voila, your model is available without any fuzz. No copying from S3 or putting it in a Docker image. And the same EFS volume can be mounted to more than one Lambda at the same time.

To learn more read:

Update 25.08.2021

User @wtfzamba tried this solution and came across a limitation that might be of interested to others:

I did indeed try the solution you suggested. It works well, but only to a point, and I'm referring to performance. In my situation, I need to be able to spin up ~100 lambdas concurrently when I do batch classification, to speed up the process. The problem is that EFS throughput cap is not PER connection, but in total. So the 300MB/s of burst throughput that I was allowed seemed to be shared by each lambda instance, which at that point timed out even before being able to load the models into memory.

Keep this in mind when you choose this option.

edited Aug 25 '21 at 11:27

answered Jul 04 '21 at 08:45

Jens

20,533
11
60
86

I was aware of EFS which would have been my alternative had I not discovered I could use ECR. I'm however not sure if the benefits of adding another element to the stack outweigh the cons of using ECR. Is there any other advantage I'm unaware of ? – wtfzambo Jul 04 '21 at 22:47
1

@wtfzambo That's a tough question to answer. How well a solution works for you often depends on your constraints and your (teams) skills. I think both options (Docker and EFS) are totally valid. My gut tells me that a "pure" Lambda with EFS is easier to manage and also it might be easier to use other services like X-Ray etc for monitoring. I also would think that cold boot should be faster with "native" Lambdas (but I have not measured yet). But if those are really good reasons to not use Docker ... I am not sure. I'd stay with your current solution and only if you feel "pain" try EFS. – Jens Jul 05 '21 at 07:41
Thanks, I completely understand. I will address cold booting by having a warmup function, because loading 1.5GB of models on memory takes some time, and since this model needs to go on our website for users to play with, I certainly cannot expect them to wait more than 5 seconds for it to return some results. – wtfzambo Jul 05 '21 at 09:47
@wtfzambo You should write an answer to your question yourself, laying out your though process and what you ended up doing. You can even accept that answer. This will help others with their decision. – Jens Jul 05 '21 at 09:49
Ok, soon as this thing is completed I will for sure! – wtfzambo Jul 05 '21 at 09:50
1

A small update regarding EFS. I did indeed try the solution you suggested. It works well, but only to a point, and I'm referring to performance. In my situation, I need to be able to spin up ~100 lambdas concurrently when I do batch classification, to speed up the process. The problem is that EFS throughput cap is not PER connection, but in total. So the 300MB/s of burst throughput that I was allowed seemed to be shared by each lambda instance, which at that point timed out even before being able to load the models into memory. – wtfzambo Aug 25 '21 at 10:04
@wtfzambo thank you for sharing this insight. Much appreciated. I will update my answer to incorporate this information. – Jens Aug 25 '21 at 11:25
1

@wtfzambo thanks for providing the details of your experiments. Though this post a bit old, it's still actual. Have you tried going from the other side and optimise the model itself (quantization, etc)? thanks – Nikita Nov 20 '22 at 23:35
Hey Nikita, thanks for the feedback. No I have not tried, mostly due to the fact that I'm not a data scientist but a DE dabbling in ML-Ops from time to time. – wtfzambo Nov 21 '22 at 14:27

Serverless machine learning: where should one store their models?

1 Answers1