2

I am currently writing an MLFlow artifact to the dbfs but I am using pandas using the code below...

temp = tempfile.NamedTemporaryFile(prefix="*****", suffix=".csv")
temp_name = temp.name
try:
  df.to_csv(temp_name, index=False)
  mlflow.log_artifact(temp_name, "******")
finally:
  temp.close() # Delete the temp file

How would I write this if 'df' was a spark dataframe?

2 Answers2

2

You just need to use the filepath URLs with proper protocols. "dbfs" is a generic Databricks one. For Azure, "abffs" would be needed. (I can not recall AWS's S3 extension)

filepath="dbfs:///filepath"
df # My Spark DataFrame
df.write.csv(filepath)
mlflow.log_artifact(temp_name, filepath)
Dan Ciborowski - MSFT
  • 6,807
  • 10
  • 53
  • 88
2

It looks like in your case the problem has to do with how spark APIs accesses the filesystem vs how python APIs accesses it see here for details. This is probably not the recommended way (I'm fairly new to Databricks myself), but if you're on a single node you can write your parquet to the local filesystem and mlflow can log it from there with something like:

with tempfile.TemporaryDirectory() as tmpdirname:
    df.write.parquet(f'file:{tmpdirname}/my_parquet_table')
    mlflow.log_artifacts(tmpdirname, artifact_path='my_parquet_table_name')

Keep in mind that a parquet "file" is actually a directory with a whole bunch of files in it, so you need to use log_artifacts, not log_artifact, and if you don't specify artifact_path you'll get all the little files that make up the parquet file (directory) dumped directly into the root of your your mlflow artifacts. Also, Mlflow doesn't have any previewing capability for parquet files, so depending on your use case, logging parquet artifacts may not be as convenient as it first seems.

HTH