0

I am using autoloader with BinaryFile option to decode .proto based files in databricks. I am able to decode the proto file and write it in csv format using foreach() and pandas library. But having challenge in writing it in delta format. End of the day, I want to write in delta format and trying to avoid one more hop in storage i.e., storing in csv.

There are few ways I could think of but it has challenges :

  1. Convert pandas dataframe to spark dataframe. I have to use sparkContext to createDataframe but I can't broadcast sparkContext to worker nodes.
  2. Avoid using pandas DF, still I need to create dataframe which is not possible with in foreach() (since load is distributed across workers)
  3. Other ways like UDF, where I will decode and explode the string returned from the decode. But that's not applicable here because, we are getting spark non-native file format i.e., proto.

Also come across few blogs but not helpful in foreach() and BinaryFile option.

  1. https://github.com/delta-io/delta-rs/tree/main/python - This is not stable yet in python

  2. https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_delta.html - This points us to the challenge 2 mentioned above.

Any leads on this is much appreciated.

Below is the skeleton code snippet for reference:

cloudfile_options = {
    "cloudFiles.subscriptionId": subscription_ID,
    "cloudFiles.connectionString": queue_connection_string,
    "cloudFiles.format": "BinaryFile", 
    "cloudFiles.tenantId":tenant_ID,
    "cloudFiles.clientId":client_ID,
    "cloudFiles.clientSecret":client_secret,
    "cloudFiles.resourceGroup": storage_resource_group,
    "cloudFiles.useNotifications" :"true"
}
 
reader_df = spark.readStream.format("cloudFiles") \
                             .options(**cloudfile_options) \
                             .load("some_storage_input_path")
                             
 
def decode_proto(self, row):
        with open(row['path'], 'rb') as f:
      // Do decoding
      // convert decoded string to Json and write to storage using pandas df
     
 
write_stream = reader_df.select("path") \
                        .writeStream \
                        .foreach(decode_proto) \
                        .option("checkpointLocation", checkpoint_path) \
                        .trigger(once=True) \
                        .start()
pavan
  • 821
  • 1
  • 8
  • 13
  • Have you tried specifying format attribute? As in `write_stream = reader_df.writeStream.format('delta')...`? – Saideep Arikontham Nov 02 '22 at 05:52
  • @SaideepArikontham, Since I am doing decoding within my decode_proto method, I should be able to return something to write as delta the way you suggested. But using foreach() , it's expected that it returns nothing. Thank you for the response. – pavan Nov 04 '22 at 02:27

0 Answers0