1

I have two dataframes in pyspark. I am trying to compare one dataframe with another to see if the values lies in range.

Below is an example of dataframe.

Dataframe df :

enter image description here

Dataframe dfcompare :

enter image description here

Output i am looking for :

enter image description here

Code I have currently is below :


def cal_OTRC(spark_df):
  compare = df.compare.fillna(0)
  df = spark_df.agg(*(F.count(F.when((F.col(c) > compare.astype(int).values.tolist()[0]) | (F.col(c) < compare[c].astype(int).values.tolist()[1]), c)).alias(c)  for c in spark_df.columns ))
  return df

out_of_range_count = cal_OTRC(df).to_koalas().rename(index={0: 'outofRange'})

However this code works for small table but for big tables this is very slow. Any improvements which can be done to run this faster for big tables

1 Answers1

0

You could translate dfcompare into a dict with one min and max entry per column.

compare = {'empid': {'min': 1, 'max': 3}, 'salary': {'min': 100, 'max': 400}}

Then a column expression can calculate per column if a value is out of the range

df = <load data>
df = df.select("*", *[ F.when((F.col(c)< compare[c]['min']) | (F.col(c)>compare[c]['max']), 1) \
    .otherwise(0).alias("out_{}".format(c)) for c in compare])

This produces a 0 or 1 per row and column:

+-----+------+---------+----------+
|empid|salary|out_empid|out_salary|
+-----+------+---------+----------+
|    1|    50|        0|         1|
|    2|   600|        0|         1|
|    3|   300|        0|         0|
|    4|   400|        1|         0|
|    5|  1000|        1|         1|
+-----+------+---------+----------+

Finally summing over the out columns gives

df = df.select( *[F.sum("out_{}".format(c)) for c in compare])
+--------------+---------------+
|sum(out_empid)|sum(out_salary)|
+--------------+---------------+
|             2|              3|
+--------------+---------------+
werner
  • 13,518
  • 6
  • 30
  • 45
  • how to convert the psypark dataframe compare to dict. I am reading data from Hive databases into psypark dataframe. In this scenario how do I put it in dictonary to run through above solution. – abhishek gaikwad Mar 28 '21 at 15:39
  • @abhishekgaikwad what is the structure of the hive table? – werner Mar 28 '21 at 15:45
  • It is not fixed has I am passing table in this logic has a parameter to get the out of range values for my table. So my table will be changing. Once i read any hive table to dataframe how do i convert that pyspark dataframe to dict. – abhishek gaikwad Mar 28 '21 at 15:53
  • @abhishekgaikwad you could check [this question](https://stackoverflow.com/q/49432167/2129801) and its various answers. It might or might not help you with the transformation. But without knowing the structure of the table its difficult to say how to transform it. It will certainly require some coding. – werner Mar 28 '21 at 15:57