0

I have started with a DataFrame. Strangely I need to extract the paths for the files to convert it to a DeltaTable. Even more strangely, the column names are lost on the DeltaTable. What is the thinking behind this? Do we need to always pair up the DeltaTable with its DataFrame and track it ourselves? If so why then is there a toDF() method on the DeltaTable: it loses the columns info!

    def convert_df_to_delta_by_path(cls, df: DataFrame, 
           out_path: str = None) -> [DeltaTable, str]:
        spark = df.sparkSession

        # we need to perform gymnastics to extract filename
        #  and construct the parquet 'identifier'..
        df_files_path_raw = df.withColumn("filename",input_file_name())
              .select('filename')
        df_files_path = df_files_path_raw.first()['filename']
        df_files_path = df_files_path[0:df_files_path.rfind('/')]
        output_id = f"parquet.`{out_path}`"
        delta_table = DeltaTable.convertToDelta(spark, output_id)
        print(delta_table.toDF())

We lost the columns info during the conversion from DataFrame to DeltaTable so the output is:

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string, 
_c4: string, _c5: string, _c6: string, _c7: string, _c8: string, 
_c9: string, _c10: string, _c11: string, _c12: string, _c13: string]

What is then the expected/recommended pattern for managing the DataFrame/DeltaTable interactions - including a more direct route to conversion between them (not involving extracting the file paths) and also retaining column names and other metadata?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

0 Answers0