3

I'm following this documentation to use ML Flow pipelines, which requires to clone this repository. If I run the complete pipeline as it is It works perfectly:

import os
from mlflow.pipelines import Pipeline

os.chdir("~/mlp-regression-template")
regression_pipeline = Pipeline(profile="local")
# Display a visual overview of the pipeline graph
regression_pipeline.inspect()
# Run the full pipeline
regression_pipeline.run()

But when I try to change the first part to read diferent dataset, I get the following error:

mlflow.exceptions.MlflowException: Resolved data file with path '/tmp/tmpv201mpms/precio_leche.csv' does not have the expected format 'parquet'.

Which is correct, the input file is not in csv format now, I added a new file to the data folder and changed the profile local.yaml file:

INGEST_DATA_LOCATION: "./data/precio_leche.csv" 

What I dont understand is that in the in the pipeline the insgest step, executes the ingest.py code:

enter image description here

Which funtion is to convert the read csv files:

def load_file_as_dataframe(file_path: str, file_format: str) -> DataFrame:
    """
    Load content from the specified dataset file as a Pandas DataFrame.

    This method is used to load dataset types that are not natively  managed by MLflow Pipelines
    (datasets that are not in Parquet, Delta Table, or Spark SQL Table format). This method is
    called once for each file in the dataset, and MLflow Pipelines automatically combines the
    resulting DataFrames together.

    :param file_path: The path to the dataset file.
    :param file_format: The file format string, such as "csv".
    :return: A Pandas DataFrame representing the content of the specified file.
    """

    if file_format == "csv":
        import pandas

        _logger.warning(
            "Loading dataset CSV using `pandas.read_csv()` with default arguments and assumed index"
            " column 0 which may not produce the desired schema. If the schema is not correct, you"
            " can adjust it by modifying the `load_file_as_dataframe()` function in"
            " `steps/ingest.py`"
        )
        return pandas.read_csv(file_path, index_col=0)
    else:
        raise NotImplementedError

After getting this error I also tried changing the pipeline.yaml file to have .csv as the default format:

  format: {{INGEST_DATA_FORMAT|default('csv')}}

But it didn't work either, also I notice that when I run the ingest step for the default dataset it return this summary from the pandas profilling library:

enter image description here

But I do not see where this is in the code, am I changing the wrong files? or what should I do in order to read csv files in the ingest step? also. I'm looking to read several files, not only one.

Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181
  • 1
    You should just need to change the configuration in `./profiles/local.yaml` to get the ingestion part working. Specifically, adjust `INGEST_DATA_FORMAT: csv`. Once you have this, the `Pipeline` object should use the `local.yaml` profile and feed it into the `./pipelines.yaml` specifically at https://github.com/mlflow/mlp-regression-template/blob/main/pipeline.yaml#L36. I don't know this specific syntax but it looks like it will check to see if you supplied an `INGEST_DATA_FORMAT` in your local profile and if not, use their `default`. – jav Sep 21 '22 at 00:15
  • 1
    There may be an easier way to check this but after you do `Pipeline('local')`, check if the ingestion format is `csv` by doing `regression_pipeline._steps[0].dataset.dataset_format` – jav Sep 21 '22 at 00:16
  • As a rule: don't `chdir` ever from your code. :-) – erip Sep 21 '22 at 20:19

1 Answers1

2

According to MLflow's documentations, your local.yaml file should look something like this:

data:
  location: {{INGEST_DATA_LOCATION|default('~/your/custom/data/path/single_file.csv')}}
  format: {{INGEST_DATA_FORMAT|default('csv')}}
  custom_loader_method: steps.ingest.load_file_as_dataframe

rather than the thing that you have shared in your question.

As for multiple-file ingestion, you might want to keep your files at the INGEST_DATA_LOCATION and/or modify the default to include a wildcard like *.csv

Bilal Qandeel
  • 727
  • 3
  • 6