0

This is my pyspark dataframe where you can row number 2 and 3 are duplicates if we exclude the column date2_chek, i would like to keep the columns with date having 2019-05-15, and the other to be deleted.

+---------+----------+---------+------------+-------+----------+----------+----------+
|Regnummer|gid       |user_id  |  user_pnr  |Leasing|     Date1|     Date2|Date2_chek|
+---------+----------+---------+------------+-------+----------+----------+----------+
|   XXX295|         2| 20000000|123123123233|      0|2019-03-08|      null|2022-12-07|
|   XXX295|         2|  3232323|222222200171|      0|2019-04-27|      null|2022-12-07|
|   XXX295|         2|  3232323|222222200171|      0|2019-04-27|2019-05-15|2019-05-15|
|   XXX295|         2|  9898988|        null|      0|2015-08-15|2015-12-22|2015-12-22|
|   XXX295|         2|  1234123|        null|      0|2015-12-23|2019-03-03|2019-03-03|
|   XXX295|         2|  3434344|192131231223|      0|2019-03-04|2019-03-07|2019-03-07|
+---------+----------+---------+------------+-------+----------+----------+----------+

I am expecting the below results, i want to prioritze the non null values compared to null values while removing the duplicates.

+---------+----------+---------+------------+-------+----------+----------+----------+
|Regnummer|gid       |user_id  |  user_pnr  |Leasing|     Date1|     Date2|Date2_chek|
+---------+----------+---------+------------+-------+----------+----------+----------+
|   XXX295|         2| 20000000|123123123233|      0|2019-03-08|      null|2022-12-07|
|   XXX295|         2|  3232323|222222200171|      0|2019-04-27|2019-05-15|2019-05-15|
|   XXX295|         2|  9898988|        null|      0|2015-08-15|2015-12-22|2015-12-22|
|   XXX295|         2|  1234123|        null|      0|2015-12-23|2019-03-03|2019-03-03|
|   XXX295|         2|  3434344|192131231223|      0|2019-03-04|2019-03-07|2019-03-07|
+---------+----------+---------+------------+-------+----------+----------+----------+

0 Answers0