-2

I am trying to create multiple columns dynamically based on filter condition after comparing two data frame with below code

source_df
+---+-----+-----+----+
|key|val11|val12|date|
+---+-----+-----+-----+
|abc|  1.1| john|2-3-21
|def|  3.0| dani|2-2-21
+---+-----+-----+------

dest_df
+---+-----+-----+------+
|key|val11|val12|date  |
+---+-----+-----+------
|abc|  2.1| jack|2-3-21|
|def|  3.0| dani|2-2-21|
-----------------------
columns= source_df.columns[1:]
joined_df=source_df\
    .join(dest_df, 'key', 'full')
for column in columns:
     column_name="difference_in_"+str(column)
     report = joined_df\
    .filter((source_df[column] != dest_df[column]))\
    .withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))


The output I expect is

#Expected
+---+-----------------+------------------+
|key| difference_in_val11| difference_in_val12 |
+---+-----------------+------------------+
|abc|[src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----------------+-------------------+

I get only one column result

#Actual
+---+-----------------+-
|key| difference_in_val12  |
+---+-----------------+-|
|abc|[src:john,dst:jack]|
+---+-----------------+-

How to generate multiple columns based on filter condition dynamically?

cloud_hari
  • 147
  • 1
  • 8

2 Answers2

1

Dataframes are immutable objects. Having said that, you need to create another dataframe using the one that got generated in the 1st iteration. Something like below -

from pyspark.sql import functions as F

columns= source_df.columns[1:]
joined_df=source_df\
    .join(dest_df, 'key', 'full')
for column in columns:
  if column != columns[-1]:
       column_name="difference_in_"+str(column)
       report = joined_df\
                    .filter((source_df[column] != dest_df[column]))\
                    .withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))

  else:
    column_name="difference_in_"+str(column)
    report1 = report.filter((source_df[column] != dest_df[column]))\
                    .withColumn(column_name, F.concat(F.lit('[src:'), source_df[column], F.lit(',dst:'),dest_df[column],F.lit(']')))
report1.show()
#report.show()

Output -

+---+-----+-----+-----+-----+-------------------+-------------------+
|key|val11|val12|val11|val12|difference_in_val11|difference_in_val12|
+---+-----+-----+-----+-----+-------------------+-------------------+
|abc|  1.1| john|  2.1| jack|  [src:1.1,dst:2.1]|[src:john,dst:jack]|
+---+-----+-----+-----+-----+-------------------+-------------------+
Dipanjan Mallick
  • 1,636
  • 2
  • 8
  • 20
  • HI @DKNY. this working only for those val1 and val2, but in some case if there is additional two columns or more in the data frame, (eg: val3,val4), then that case it won't work. so is there any alternative solution to dynamically create the column? – cloud_hari Mar 14 '22 at 05:48
0

You could also do this with a union of both dataframes and then collect list only if collect_set size is greater than 1 , this can avoid joining the dataframes:

from pyspark.sql import functions as F
cols = source_df.drop("key").columns

output = (source_df.withColumn("ref",F.lit("src:"))
          .unionByName(dest_df.withColumn("ref",F.lit("dst:"))).groupBy("key")
.agg(*[F.when(F.size(F.collect_set(i))>1,F.collect_list(F.concat("ref",i))).alias(i)
       for i in cols]).dropna(subset = cols, how='all')
         )

output.show()

+---+------------------+--------------------+
|key|             val11|               val12|
+---+------------------+--------------------+
|abc|[src:1.1, dst:2.1]|[src:john, dst:jack]|
+---+------------------+--------------------+
anky
  • 74,114
  • 11
  • 41
  • 70
  • Hi @anky, agreed on above answer, but sometimes i seeing schema difference in either of data frame and in that case 'union' throw error as it expect schema of both should be same.. is there any alternative solution to this – cloud_hari Mar 14 '22 at 05:49
  • @cloud_hari Unfortunately we cannot compare with different datatypes so I can't think of any alternatives. may be cast to similar datatypes before comparison – anky Mar 14 '22 at 07:48
  • Hi @anky.. ok got it.. by the way i tried with above solution with additional column added, for eg: 'date', but it retuning null.. i am updating the question with this addition column name +---+------------------+--------------------+-- |key| val11| val12| date | +---+------------------+----------------+------ |abc|[src:1.1, dst:2.1]|[src:john, dst:jack]| null | +---+------------------+--------------------+-- – cloud_hari Mar 14 '22 at 08:15
  • @cloud_hari you get nulls because there are no differences in those columns. What would you expect in such cases? Its not recommended to drop the entire row if one column has no difference. For example change the second line of destdf to `|def| 3.0| danial|2-2-21|` and let me know what is your expected output – anky Mar 14 '22 at 09:33
  • HI @anky, but i need to show the difference against each date respectively.. – cloud_hari Mar 14 '22 at 16:33