Using code_path in mlflow.pyfunc models on Databricks

Question

We are using Databricks over AWS infra, registering models on mlflow. We write our in-project imports as from src.(module location) import (objects).

Following examples online, I expected that when I use mlflow.pyfunc.log_model(..., code_path=['PROJECT_ROOT/src'], ...), that would add the entire code tree to the model's running environment and thus allow us to keep our imports as-are.

When logging the model, I get a long list of [Errno 95] Operation not supported, one for each notebook in our repo. This blocks us from registering the model to mlflow.

We have used several ad-hoc solutions and workarounds, from forcing ourselves to work with all code in one file, to only working with files in the same directory (code_path=['./filename.py'], to adding specific libraries (and changing import paths accordingly), etc.

However none of these is optimal. As a result we either duplicate code (killing DRY), or we put some imports inside the wrapper (i.e. those that cannot be run in our working environment since it's different from the one the model will experience when deployed), etc.

We have not yet tried to put all the notebooks (which we believe cause [Errno 95] Operation not supported) in a separate folder. This will be highly disruptive to our current situation and processes, and we'd like to avoid that as much as we can.

Please advise

David · Answer 1 · 2023-06-21T19:27:35.503

I had a similar struggle with Databricks when using custom model logic from an src directory (similar structure to cookiecutter-data-science). The solution was to log the entire src directory using the relative path.

So if you have the following project structure.

.
├── notebooks
│   └── train.py
└── src
    ├── __init__.py
    └── model.py

Your train.py should look like this. Note AddN comes from the MLflow Docs.

import mlflow

from src.model import AddN

model = AddN(n=5)

mlflow.pyfunc.log_model(
    registered_model_name="add_n_model",
    artifact_path="add_n_model",
    python_model=model,
    code_path=["../src"],
)

This will copy all code in src/ and log it in the MLflow artifact allowing the model to load all dependencies.

If you are not using a notebooks/ directory, you will set code_path=["src"]. If you are using sub-directies like notebooks/train/train.py, you will set code_path=["../../src"].

Using code_path in mlflow.pyfunc models on Databricks

1 Answers1

Linked