2

I am running an AWS Pyspark Glue Job where I am reading the S3 raw path where the data has been loaded from Redshift and I am doing some transformations on top of it. Below is my code:

        data = spark.read.parquet(rawPath) # complete dataset. Incase of incremental it will be directly loaded from different dataframe so it would be like data = incrLoad(which has incremental records)
        . 
        . #some transformations
        .
        app8 = app7.withColumn("edl_created_at", current_timestamp()) #Final line of transformation
        

        if (incrementalLoad == str(0)):
            app8.write.mode("overwrite").parquet(transformedPath)#loc1
            print(":::::Transformed data written for fullload::::::")  

        elif (incrementalLoad == str(1)):
            app8.write.mode("append").parquet(transformedPath)#loc1
            print(":::Incremental Transformed data has been written::::::::")
            transformedData = spark.read.parquet(transformedPath)
            print("::::::Transformed data has been written:::::")
            finalDF = transformedData.groupBy(col("mobilenumber")).agg(
                sum(col("times_app_uninstalled")).alias("times_app_uninstalled"),
                sum(col("times_uninstall_l30d")).alias("times_uninstall_l30d"),
                sum(col("times_uninstall_l60d")).alias("times_uninstall_l60d"),
                sum(col("times_uninstall_l90d")).alias("times_uninstall_l90d"),
                sum(col("times_uninstall_l180d")).alias("times_uninstall_l180d"),
                sum(col("times_uninstall_l270d")).alias("times_uninstall_l270d"),
                sum(col("times_uninstall_l365d")).alias("times_uninstall_l365d"),
                max(col("latest_uninstall_date")).alias("latest_uninstall_date"),
                min(col("first_uninstall_date")).alias("first_uninstall_date"))
            finalDF.write.mode("overwrite").parquet(transformedPath)#loc1

Where incrementalLoad==0 indicates a full Load and 1 indicates an incremental transformed data load. So, here for the full load, I am reading the complete dataset and app8 is the last transformed data frame that is getting written in S3. Now, in the case of incremental, I am doing transformations only on the incremental raw dataset that has been loaded. As can be seen from the elif loop that I am appending the transformed dataset to the existing transformed path. Later on, reading the same path, do some aggregations, and trying to write it to the same path which gives me the below error:

No such file or directory  

This is because of the spark's lazy evaluation. Because when it comes across 'write with overwrite' mode it deletes the directory first and then tries to read it and so on.

To avoid this I thought of two solutions:

  1. Storing the raw data (complete + incremental) at one place and then doing transformations that would work as expected but since the data size is huge more than 1.5 million and every day the size would increase so, it is not the best way of reading data from S3.
  2. Creating a temp directory. I cannot do this as well. Suppose I have two locations and I am reading from one directory say loc1 and writing the transformed data in another directory say loc2 but when my job runs again the next day it should read from loc2 which I cannot see happening in my case.

Any help is much appreciated. What can I do best in my case?

whatsinthename
  • 1,828
  • 20
  • 59

0 Answers0