I have two dataframes in pyspark. I am trying to compare one dataframe with another to see if the values lies in range.
Below is an example of dataframe.
Dataframe df :
Dataframe dfcompare :
Output i am looking for :
Code I have currently is below :
def cal_OTRC(spark_df):
compare = df.compare.fillna(0)
df = spark_df.agg(*(F.count(F.when((F.col(c) > compare.astype(int).values.tolist()[0]) | (F.col(c) < compare[c].astype(int).values.tolist()[1]), c)).alias(c) for c in spark_df.columns ))
return df
out_of_range_count = cal_OTRC(df).to_koalas().rename(index={0: 'outofRange'})
However this code works for small table but for big tables this is very slow. Any improvements which can be done to run this faster for big tables