I am running an AWS Pyspark Glue Job
where I am reading the S3 raw
path where the data has been loaded from Redshift
and I am doing some transformations
on top of it. Below is my code:
data = spark.read.parquet(rawPath) # complete dataset. Incase of incremental it will be directly loaded from different dataframe so it would be like data = incrLoad(which has incremental records)
.
. #some transformations
.
app8 = app7.withColumn("edl_created_at", current_timestamp()) #Final line of transformation
if (incrementalLoad == str(0)):
app8.write.mode("overwrite").parquet(transformedPath)#loc1
print(":::::Transformed data written for fullload::::::")
elif (incrementalLoad == str(1)):
app8.write.mode("append").parquet(transformedPath)#loc1
print(":::Incremental Transformed data has been written::::::::")
transformedData = spark.read.parquet(transformedPath)
print("::::::Transformed data has been written:::::")
finalDF = transformedData.groupBy(col("mobilenumber")).agg(
sum(col("times_app_uninstalled")).alias("times_app_uninstalled"),
sum(col("times_uninstall_l30d")).alias("times_uninstall_l30d"),
sum(col("times_uninstall_l60d")).alias("times_uninstall_l60d"),
sum(col("times_uninstall_l90d")).alias("times_uninstall_l90d"),
sum(col("times_uninstall_l180d")).alias("times_uninstall_l180d"),
sum(col("times_uninstall_l270d")).alias("times_uninstall_l270d"),
sum(col("times_uninstall_l365d")).alias("times_uninstall_l365d"),
max(col("latest_uninstall_date")).alias("latest_uninstall_date"),
min(col("first_uninstall_date")).alias("first_uninstall_date"))
finalDF.write.mode("overwrite").parquet(transformedPath)#loc1
Where incrementalLoad==0
indicates a full Load
and 1
indicates an incremental transformed data load
. So, here for the full load, I am reading the complete dataset and app8
is the last transformed data frame that is getting written in S3
. Now, in the case of incremental, I am doing transformations only on the incremental
raw dataset that has been loaded. As can be seen from the elif
loop that I am appending the transformed dataset to the existing transformed path
. Later on, reading the same path, do some aggregations, and trying to write it to the same path which gives me the below error:
No such file or directory
This is because of the spark's lazy evaluation
. Because when it comes across 'write with overwrite
' mode it deletes the directory first and then tries to read it and so on.
To avoid this I thought of two solutions
:
- Storing the
raw
data(complete + incremental
) at one place and then doing transformations that would work as expected but since the data size is huge more than 1.5 million and every day the size would increase so, it is not the best way of reading data from S3. - Creating a
temp
directory. I cannot do this as well. Suppose I have two locations and I am reading from one directory sayloc1
and writing the transformed data in another directory sayloc2
but when my job runs again the next day it should read fromloc2
which I cannot see happening in my case.
Any help is much appreciated. What can I do best in my case?