I am using autoloader with BinaryFile option to decode .proto based files in databricks. I am able to decode the proto file and write it in csv format using foreach() and pandas library. But having challenge in writing it in delta format. End of the day, I want to write in delta format and trying to avoid one more hop in storage i.e., storing in csv.
There are few ways I could think of but it has challenges :
- Convert pandas dataframe to spark dataframe. I have to use sparkContext to createDataframe but I can't broadcast sparkContext to worker nodes.
- Avoid using pandas DF, still I need to create dataframe which is not possible with in foreach() (since load is distributed across workers)
- Other ways like UDF, where I will decode and explode the string returned from the decode. But that's not applicable here because, we are getting spark non-native file format i.e., proto.
Also come across few blogs but not helpful in foreach() and BinaryFile option.
https://github.com/delta-io/delta-rs/tree/main/python - This is not stable yet in python
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_delta.html - This points us to the challenge 2 mentioned above.
Any leads on this is much appreciated.
Below is the skeleton code snippet for reference:
cloudfile_options = {
"cloudFiles.subscriptionId": subscription_ID,
"cloudFiles.connectionString": queue_connection_string,
"cloudFiles.format": "BinaryFile",
"cloudFiles.tenantId":tenant_ID,
"cloudFiles.clientId":client_ID,
"cloudFiles.clientSecret":client_secret,
"cloudFiles.resourceGroup": storage_resource_group,
"cloudFiles.useNotifications" :"true"
}
reader_df = spark.readStream.format("cloudFiles") \
.options(**cloudfile_options) \
.load("some_storage_input_path")
def decode_proto(self, row):
with open(row['path'], 'rb') as f:
// Do decoding
// convert decoded string to Json and write to storage using pandas df
write_stream = reader_df.select("path") \
.writeStream \
.foreach(decode_proto) \
.option("checkpointLocation", checkpoint_path) \
.trigger(once=True) \
.start()