This is my pyspark dataframe where you can row number 2 and 3 are duplicates if we exclude the column date2_chek, i would like to keep the columns with date having 2019-05-15, and the other to be deleted.
+---------+----------+---------+------------+-------+----------+----------+----------+
|Regnummer|gid |user_id | user_pnr |Leasing| Date1| Date2|Date2_chek|
+---------+----------+---------+------------+-------+----------+----------+----------+
| XXX295| 2| 20000000|123123123233| 0|2019-03-08| null|2022-12-07|
| XXX295| 2| 3232323|222222200171| 0|2019-04-27| null|2022-12-07|
| XXX295| 2| 3232323|222222200171| 0|2019-04-27|2019-05-15|2019-05-15|
| XXX295| 2| 9898988| null| 0|2015-08-15|2015-12-22|2015-12-22|
| XXX295| 2| 1234123| null| 0|2015-12-23|2019-03-03|2019-03-03|
| XXX295| 2| 3434344|192131231223| 0|2019-03-04|2019-03-07|2019-03-07|
+---------+----------+---------+------------+-------+----------+----------+----------+
I am expecting the below results, i want to prioritze the non null values compared to null values while removing the duplicates.
+---------+----------+---------+------------+-------+----------+----------+----------+
|Regnummer|gid |user_id | user_pnr |Leasing| Date1| Date2|Date2_chek|
+---------+----------+---------+------------+-------+----------+----------+----------+
| XXX295| 2| 20000000|123123123233| 0|2019-03-08| null|2022-12-07|
| XXX295| 2| 3232323|222222200171| 0|2019-04-27|2019-05-15|2019-05-15|
| XXX295| 2| 9898988| null| 0|2015-08-15|2015-12-22|2015-12-22|
| XXX295| 2| 1234123| null| 0|2015-12-23|2019-03-03|2019-03-03|
| XXX295| 2| 3434344|192131231223| 0|2019-03-04|2019-03-07|2019-03-07|
+---------+----------+---------+------------+-------+----------+----------+----------+