I have a pyspark dataframe like this:
+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null| CT|
|222|name1| CT|
|222|name2| CT|
|333|name3| CT|
|333|name4| CT|
|333| null| CT|
+---+-----+-----+
For a given ID, I would like to keep that record even though column "name" is null if its a ID is not repeated, but if the ID is repeated, then I would like to check on name column and make sure it does not contain duplicates within that ID, and also remove if "name" is null ONLY for repeated IDs. Below is the desired output:
+-----+---+-----+
| id| name|state|
+-----+---+-----+
|111| null| CT|
|222|name1| CT|
|222|name2| CT|
|333|name3| CT|
|333|name4| CT|
+---+-----+-----+
How can I achieve this in PySpark?