How do I unit test a function in the CI pipeline that uses model files that are not part of the git remote?

Question

I am developing machine learning repositories that require fairly large trained model files to run. These files are not part of the git remote but is tracked by DVC and is saved in a separate remote storage. I am running into issues when I am trying to run unit tests in the CI pipeline for functions that require these model files to make their prediction. Since I don't have access them in the git remote, I can't test them.

What is the best practice that people usually do in this situation? I can think of couple of options -

Pull the models from the DVC remote inside the CI pipeline. I don't want to do this becasue downloading models every time you want to run push some code will quickly eat up my usage minutes for CI and is an expensive option.
Use unittest.mock to simulate the output of from the model prediction and test other parts of my code. This is what I am doing now but it's sort of a pain with unittest's mock functionalities. That module wasn't really developed with ML in mind from what I can tell. It's missing (or is hard to find) some functionalities that I would have really liked. Are there any good tools for doing this geared specifically towards ML?
Do weird reformatting of the function definition that allows me to essentially do option 2 but without a mock module. That is, just test the surrounding logic and don't worry about the model output.
Just put the model files in the git remote and be done with it. Only use DVC to track data.

What do people usually do in this situation?

Shcheklein · Accepted Answer · 2020-10-22T00:09:09.590

If we talk about unit tests, I think it's indeed better to do a mock. It's best to have unit tests small, testing actual logic of the unit, etc. It's good to have other tests though that would pull the model and run some logic on top of that - I would call them integration tests.

It's not black and white though. If you for some reason see that it's easier to use an actual model (e.g. it changes a lot and it is easier to use it instead of maintaining and updating stubs/fixtures), you could potentially cache it.

I think, to help you with the mock, you would need to share some technical details- how does the function look like, what have you tried, what breaks, etc.

to do this because downloading models every time you want to run push some code will quickly eat up my usage minutes for CI and is an expensive option.

I think you can potentially utilize CI systems cache to avoid downloading it over and over again. This is the GitHub Actions related repository, this is CircleCI. The idea is the same across all common CI providers. Which one are considering to use, btw?

Just put the model files in the git remote and be done with it. Only use DVC to track data.

This can be the way, but if models are large enough you will pollute Git history significantly. On some CI systems it can become even slower since they will be fetching this with regular git clone. Effectively, downloading models anyway.

Btw, if you use DVC or not take a look at another open-source project that is made specifically to do CI/CD for ML - CML.

Thank you for the answer. I wasn't aware of caching and that does sound like a good option to try. I am using GitLab CI and it also does seem to support caching. The problem with mocking is when I have a need to generate large numpy arrays whose output is supposed to be different every time the function is invoked (within a loop of the function being tested). In addition, I would also like the mock function to generate it's output based on an input that is given. I am not sure how to make these work. — Ananda, Oct 26 '20 at 03:41

How do I unit test a function in the CI pipeline that uses model files that are not part of the git remote?

1 Answers1

Linked