I have started with a DataFrame
. Strangely I need to extract the paths for the files to convert it to a DeltaTable
. Even more strangely, the column names are lost on the DeltaTable
. What is the thinking behind this? Do we need to always pair up the DeltaTable
with its DataFrame
and track it ourselves? If so why then is there a toDF()
method on the DeltaTable
: it loses the columns info!
def convert_df_to_delta_by_path(cls, df: DataFrame,
out_path: str = None) -> [DeltaTable, str]:
spark = df.sparkSession
# we need to perform gymnastics to extract filename
# and construct the parquet 'identifier'..
df_files_path_raw = df.withColumn("filename",input_file_name())
.select('filename')
df_files_path = df_files_path_raw.first()['filename']
df_files_path = df_files_path[0:df_files_path.rfind('/')]
output_id = f"parquet.`{out_path}`"
delta_table = DeltaTable.convertToDelta(spark, output_id)
print(delta_table.toDF())
We lost the columns info during the conversion from DataFrame
to DeltaTable
so the output is:
DataFrame[_c0: string, _c1: string, _c2: string, _c3: string,
_c4: string, _c5: string, _c6: string, _c7: string, _c8: string,
_c9: string, _c10: string, _c11: string, _c12: string, _c13: string]
What is then the expected/recommended pattern for managing the DataFrame
/DeltaTable
interactions - including a more direct route to conversion between them (not involving extracting the file paths) and also retaining column names and other metadata?