I have a table with each row representing a person. In this table there are a lot of duplicates that I want to get rid of. I want to deduplicate based on name
and age
only. However, the information in columns can be spread between different rows for the same employee. For example:
name | age | height | eye_color | weight |
---|---|---|---|---|
John | 32 | null | green | null |
John | 32 | null | null | 75 |
John | 32 | 180 | null | null |
John | 32 | null | null | 74 |
In this example, the expected output would be:
name | age | height | eye_color | weight |
---|---|---|---|---|
John | 32 | 180 | green | 75 |
Note that it doesn't matter if the weight is 75 or 74, the order is not important for my use case, I just want to fill as much null
as possible, while removing duplicates.
There is a unicity constraint on some columns, so simply updating all rows with the desired values and then keeping one row per group is not an option unfortunately, ie updating the table to look like this:
name | age | height | eye_color | weight |
---|---|---|---|---|
John | 32 | 180 | green | 75 |
John | 32 | 180 | green | 75 |
John | 32 | 180 | green | 75 |
John | 32 | 180 | green | 75 |
before deduplicating is not possible.
If age
or name
is null
for an employee, it shouldn't be deduplicated at all.
A similar question on stackoverflow was Remove duplicates with less null values but they only keep the row with the least amount of null
so it's not really solving my problem.
However, maybe there is something to do with some aggregation as shown in PostgreSQL: get first non null value per group but I couldn't get anything to work for the moment.
Any idea?
Thanks.