I'm following this documentation to use ML Flow pipelines, which requires to clone this repository. If I run the complete pipeline as it is It works perfectly:
import os
from mlflow.pipelines import Pipeline
os.chdir("~/mlp-regression-template")
regression_pipeline = Pipeline(profile="local")
# Display a visual overview of the pipeline graph
regression_pipeline.inspect()
# Run the full pipeline
regression_pipeline.run()
But when I try to change the first part to read diferent dataset, I get the following error:
mlflow.exceptions.MlflowException: Resolved data file with path '/tmp/tmpv201mpms/precio_leche.csv' does not have the expected format 'parquet'.
Which is correct, the input file is not in csv format now, I added a new file to the data folder and changed the profile local.yaml file:
INGEST_DATA_LOCATION: "./data/precio_leche.csv"
What I dont understand is that in the in the pipeline the insgest step, executes the ingest.py code:
Which funtion is to convert the read csv files:
def load_file_as_dataframe(file_path: str, file_format: str) -> DataFrame:
"""
Load content from the specified dataset file as a Pandas DataFrame.
This method is used to load dataset types that are not natively managed by MLflow Pipelines
(datasets that are not in Parquet, Delta Table, or Spark SQL Table format). This method is
called once for each file in the dataset, and MLflow Pipelines automatically combines the
resulting DataFrames together.
:param file_path: The path to the dataset file.
:param file_format: The file format string, such as "csv".
:return: A Pandas DataFrame representing the content of the specified file.
"""
if file_format == "csv":
import pandas
_logger.warning(
"Loading dataset CSV using `pandas.read_csv()` with default arguments and assumed index"
" column 0 which may not produce the desired schema. If the schema is not correct, you"
" can adjust it by modifying the `load_file_as_dataframe()` function in"
" `steps/ingest.py`"
)
return pandas.read_csv(file_path, index_col=0)
else:
raise NotImplementedError
After getting this error I also tried changing the pipeline.yaml file to have .csv as the default format:
format: {{INGEST_DATA_FORMAT|default('csv')}}
But it didn't work either, also I notice that when I run the ingest step for the default dataset it return this summary from the pandas profilling library:
But I do not see where this is in the code, am I changing the wrong files? or what should I do in order to read csv files in the ingest step? also. I'm looking to read several files, not only one.