I am developing machine learning repositories that require fairly large trained model files to run. These files are not part of the git remote but is tracked by DVC and is saved in a separate remote storage. I am running into issues when I am trying to run unit tests in the CI pipeline for functions that require these model files to make their prediction. Since I don't have access them in the git remote, I can't test them.
What is the best practice that people usually do in this situation? I can think of couple of options -
- Pull the models from the DVC remote inside the CI pipeline. I don't want to do this becasue downloading models every time you want to run push some code will quickly eat up my usage minutes for CI and is an expensive option.
- Use
unittest.mock
to simulate the output of from the model prediction and test other parts of my code. This is what I am doing now but it's sort of a pain with unittest's mock functionalities. That module wasn't really developed with ML in mind from what I can tell. It's missing (or is hard to find) some functionalities that I would have really liked. Are there any good tools for doing this geared specifically towards ML? - Do weird reformatting of the function definition that allows me to essentially do option 2 but without a mock module. That is, just test the surrounding logic and don't worry about the model output.
- Just put the model files in the git remote and be done with it. Only use DVC to track data.
What do people usually do in this situation?