0

My setting

I have developed an environment for ML experiments that looks like the following: training happens in the AWS cloud with SageMaker Training Jobs. The trained model is stored in the /opt/ml/model directory, which is reserved by SageMaker to pack models as a .tar.gz in SageMaker's own S3 bucket. Several evaluation metrics are computed during training and testing, and recorded to an MLflow infrastructure consisting of an S3-based artifact store (see Scenario 4). Note that this is a different S3 bucket than SageMaker's.

A very useful feature from MLflow is that any model artifacts can be logged to a training run, so data scientists have access to both metrics and more complex outputs through the UI. These outputs include (but are not limited to) the trained model itself.

A limitation is that, as I understand it, the MLflow API for logging artifacts only accepts as input a local path to the artifact itself, and will always upload it to its artifact store. This is suboptimal when the artifacts are stored somewhere outside MLflow, as you have to store them twice. A transformer model may weigh more than 1GB.

My questions

  • Is there a way to pass an S3 path to MLflow and make it count as an artifact, without having to download it locally first?
  • Is there a way to avoid pushing a copy of an artifact to the artifact store? If my artifacts already reside in another remote location, it would be ideal to just have a link to such location in MLflow and not a copy in MLflow storage.
Javier Beltrán
  • 128
  • 2
  • 10

2 Answers2

0

You can use a Tracking Server with S3 as a backend

Sergey Pryvala
  • 506
  • 4
  • 6
  • 1
    I don't think this answer the question. Javier has set up the store already, but is asking for handling pointers: "suboptimal when the artifacts are stored somewhere outside MLflow". – Maciej Skorski May 20 '23 at 03:53
0

Based on this motivation

will always upload it to its artifact store. This is suboptimal when the artifacts are stored somewhere outside MLflow, as you have to store them twice

I read the question as a request for handling artefacts via references to external objects not managed by MLTracking. I am afraid this may be tricky, as MLFlow is designed to manage artefacts (read/write) in its own structured way (schemas).

You can do

  1. [Partial integration] Logging paths as artefacts (so you have your pointers/references to objects under runs) and managing them with custom code. Plus, in the nearest future we can hope for more features from the MLFlow model API which is under active development and has variants supporting various libraries, from light sklearn to Transformers.
  2. [Full integration] Your use-case can be solved, in principle, by following the workflow for pre-existing models, where you need to define your own data loaders (e.g. they can deserialize pickled object from remote locations, convert to a MLFlow-compatible format and so on). But this is an advanced setup, rarely recommended.
Maciej Skorski
  • 2,303
  • 6
  • 14