Logging a PySpark dataframe into a MLFlow Artifact

Question

I am currently writing an MLFlow artifact to the dbfs but I am using pandas using the code below...

temp = tempfile.NamedTemporaryFile(prefix="*****", suffix=".csv")
temp_name = temp.name
try:
  df.to_csv(temp_name, index=False)
  mlflow.log_artifact(temp_name, "******")
finally:
  temp.close() # Delete the temp file

How would I write this if 'df' was a spark dataframe?

what do you mean by "How would I write this if 'df' was a spark dataframe?" ? — Waqas, Nov 25 '20 at 08:14
I am currently able to log this artifact using pandas but am trying to figure out how to do it using spark. — Canadian_Hombre, Nov 25 '20 at 13:58
Hi, have you figured out the solution? I'm trying to write PySpark DataFrame into parquet file and try `log_artifact()` but failed... — johnnyasd12, Nov 19 '21 at 16:00

Dan Ciborowski - MSFT · Answer 1 · 2022-02-24T01:51:09.033

2

You just need to use the filepath URLs with proper protocols. "dbfs" is a generic Databricks one. For Azure, "abffs" would be needed. (I can not recall AWS's S3 extension)

filepath="dbfs:///filepath"
df # My Spark DataFrame
df.write.csv(filepath)
mlflow.log_artifact(temp_name, filepath)

edited Feb 24 '22 at 01:51

answered Feb 24 '22 at 01:42

Dan Ciborowski - MSFT

6,807
10
53
88

score 2 · Answer 2 · answered Jul 08 '22 at 04:23

It looks like in your case the problem has to do with how spark APIs accesses the filesystem vs how python APIs accesses it see here for details. This is probably not the recommended way (I'm fairly new to Databricks myself), but if you're on a single node you can write your parquet to the local filesystem and mlflow can log it from there with something like:

with tempfile.TemporaryDirectory() as tmpdirname:
    df.write.parquet(f'file:{tmpdirname}/my_parquet_table')
    mlflow.log_artifacts(tmpdirname, artifact_path='my_parquet_table_name')

Keep in mind that a parquet "file" is actually a directory with a whole bunch of files in it, so you need to use log_artifacts, not log_artifact, and if you don't specify artifact_path you'll get all the little files that make up the parquet file (directory) dumped directly into the root of your your mlflow artifacts. Also, Mlflow doesn't have any previewing capability for parquet files, so depending on your use case, logging parquet artifacts may not be as convenient as it first seems.

HTH

Logging a PySpark dataframe into a MLFlow Artifact

2 Answers2