I have the below code where I need to reuse the flag from the previous day. So I am running the loop. I can't use the offset here as once I know the flag from the previous day then only I can use it for today. So, this loop runs 1000 times and after this whenever I try to do some operations on 'data_wt_flag1', it takes too much time, and after a while that results in a "spark driver stopped unexpectedly" error. I believe it is due to a memory issue. Is there a better way to write this logic? As I mentioned I can't use offset.
DateList=data.select("Date").distinct().orderBy('AsOfDate').rdd.flatMap(lambda x: x).collect()
Flag_list=[]
data_wt_flag1=spark.createDataFrame(data = [],schema = StructType([]))
for daily_date in DateList:
print(daily_date)
Temp_data_daily=data.filter(col("Date").isin(daily_date))
Temp_data_daily=Temp_data_daily.withColumn('lag_1',when(col("identifier").isin(Flag_list),1).otherwise(0))
Temp_data_daily=Temp_data_daily.withColumn("condition_1", when(((col('col_1')==1) & ((col('col_2')==1) | (col('lag_1')==1))),1).otherwise(0))
Flag_list=Temp_data_daily.filter(col('condition_1')==1).select("identifier").distinct().rdd.flatMap(lambda x: x).collect()
data_wt_flag1=data_wt_flag1.unionByName(Temp_data_daily,allowMissingColumns=True)
Logic of the code in Word:
If (col_1==1 and (col_2==1 or yesterday(condition_1)==1)) then today(condition_1)=1 otherwise 0.
So for the first date in the data, yesterday(condition_1) will be 0 for all identifiers so I am passing the null flag_list initially in a loop and then it will keep changing in every iteration and will be used to flag the identifier in the next iteration thus creating the lag_condition_1
Below is the sample data. I have only shown the columns which are required.
Identifier | Date | col_1 | col_2 |
---|---|---|---|
ABC | 2023-08-20 | 1 | 1 |
GHI | 2023-08-20 | 0 | 0 |
ABC | 2023-08-21 | 1 | 0 |
GHI | 2023-08-21 | 1 | 0 |
ABC | 2023-08-22 | 1 | 0 |
GHI | 2023-08-22 | 1 | 0 |
ABC | 2023-08-23 | 1 | 0 |
GHI | 2023-08-23 | 0 | 0 |
Below Table show the desired output.
Identifier | Date | col_1 | col_2 | lag_condition_1 | condition_1 |
---|---|---|---|---|---|
ABC | 2023-08-20 | 1 | 1 | 0 | 1 |
GHI | 2023-08-20 | 0 | 0 | 0 | 0 |
ABC | 2023-08-21 | 1 | 0 | 1 | 1 |
GHI | 2023-08-21 | 1 | 0 | 0 | 0 |
ABC | 2023-08-22 | 1 | 0 | 1 | 1 |
GHI | 2023-08-22 | 1 | 0 | 0 | 0 |
ABC | 2023-08-23 | 1 | 0 | 1 | 1 |
GHI | 2023-08-23 | 0 | 0 | 0 | 0 |
Here, for the first date, all lag_1 are zeros as I am passing an empty list. Then for the second date, we will have lag_1=1 for ABC and MNO as it has condition_1=1 in the previous date.