My setting
I have developed an environment for ML experiments that looks like the following: training happens in the AWS cloud with SageMaker Training Jobs. The trained model is stored in the /opt/ml/model
directory, which is reserved by SageMaker to pack models as a .tar.gz
in SageMaker's own S3 bucket. Several evaluation metrics are computed during training and testing, and recorded to an MLflow infrastructure consisting of an S3-based artifact store (see Scenario 4). Note that this is a different S3 bucket than SageMaker's.
A very useful feature from MLflow is that any model artifacts can be logged to a training run, so data scientists have access to both metrics and more complex outputs through the UI. These outputs include (but are not limited to) the trained model itself.
A limitation is that, as I understand it, the MLflow API for logging artifacts only accepts as input a local path to the artifact itself, and will always upload it to its artifact store. This is suboptimal when the artifacts are stored somewhere outside MLflow, as you have to store them twice. A transformer model may weigh more than 1GB.
My questions
- Is there a way to pass an S3 path to MLflow and make it count as an artifact, without having to download it locally first?
- Is there a way to avoid pushing a copy of an artifact to the artifact store? If my artifacts already reside in another remote location, it would be ideal to just have a link to such location in MLflow and not a copy in MLflow storage.