How to use gitlab cache to store model weights for an ML pipeline?

Question

I am using gitlab to host an python-Machine Learning pipeline. The pipeline includes trained weights of some model which I do not want to store in git. The weights are stored in some remote data-storage that the pipeline automatically pulls when running its job.

This works, but I have a problem when trying to run some end-end automatic CI tests on with this setup. I do not want to download the model weights from the remote every time my CI is triggered (since that can get expensive). In fact, I want to completely block out my internet connection within all CI-tests for security reasons (for example by configuring socket in my conftest.py).

If I do this, obviously I am not able to access the location where my model weights are stored. I know I can mock the result of the model for testing, but I actually want to test that the weights of the model is sensible or not. So mocking is out of the question.

I posted a similar question before and one of the solutions that I got was to take advantage of gitlab's caching mechanism to store the model weights.

However, I am not able to figure out how to do that exactly. From what I understand of caching, if I enable it, gitlab will download the necessary files from the internet once and reuse them in later pipelines. However, the solution that I am looking for would look something like this -

Upload a file to gitlab manually.
This file is accessible to all my CI jobs, however, this is not tracked by git.
When the file becomes outdated (because I created a new model), I manually upload the updated file.
With the cache workflow, from what I understand, if I want to update the file, I will have to enable the internet in the testing suite, have the pipeline automatically download the new set of weights, and then disable the internet again once the new cache is set up. This feels hacky and unsafe (unsafe, because I never want to enable internet during testing).

Is there a good solution for this problem?

Is this running on machine(s) which you control? If so, you could potentially look at another solution of just having the files on the machines (not handled by gitlab ci), and just copy the files into the `${CI_PROJECT_DIR}` when running the tests. And if it's over multiple machines, use some kind of NFS mount so it's shared across all of them. (Although not exactly good practice, but is a workaround). — Rekovni, Mar 25 '21 at 11:59

Fony Lew · Accepted Answer · 2021-03-26T10:20:01.967

1

One possible solution, but may not flexible enough, is keeping model file in GitLab CI Variables and put into the correct path in the step. GitLab CI supports binary file as a variable as well.

edited Mar 26 '21 at 10:20

answered Mar 26 '21 at 07:28

Fony Lew

505
4
16

score 0 · Answer 2 · answered Jul 27 '23 at 18:56

See GitLab 16.2 (July 2023) can help:

Track your machine learning model experiments

When data scientists create machine learning (ML) models, they often experiment with different parameters, configurations, and feature engineering, so they can improve the performance of the model.

The data scientists need to keep track of all of this metadata and the associated artifacts, so they can later replicate the experiment. This work is not trivial, and existing solutions require complex setup.

With machine learning model experiments, data scientists can log parameters, metrics, and artifacts directly into GitLab, giving easy access to their most performant models. This feature is an experiment.

See Documentation and Issue.

How to use gitlab cache to store model weights for an ML pipeline?

2 Answers2

Track your machine learning model experiments