0

I have a table with each row representing a person. In this table there are a lot of duplicates that I want to get rid of. I want to deduplicate based on name and age only. However, the information in columns can be spread between different rows for the same employee. For example:

name age height eye_color weight
John 32 null green null
John 32 null null 75
John 32 180 null null
John 32 null null 74

In this example, the expected output would be:

name age height eye_color weight
John 32 180 green 75

Note that it doesn't matter if the weight is 75 or 74, the order is not important for my use case, I just want to fill as much null as possible, while removing duplicates.

There is a unicity constraint on some columns, so simply updating all rows with the desired values and then keeping one row per group is not an option unfortunately, ie updating the table to look like this:

name age height eye_color weight
John 32 180 green 75
John 32 180 green 75
John 32 180 green 75
John 32 180 green 75

before deduplicating is not possible.

If age or name is null for an employee, it shouldn't be deduplicated at all.

A similar question on stackoverflow was Remove duplicates with less null values but they only keep the row with the least amount of null so it's not really solving my problem.

However, maybe there is something to do with some aggregation as shown in PostgreSQL: get first non null value per group but I couldn't get anything to work for the moment.

Any idea?

Thanks.

victorfink
  • 343
  • 4
  • 17

1 Answers1

1

Simple group by combined with max as an aggregation function should do the trick

SELECT
 name,
 age,
 max(height),
 max(eye_color),
 max(weight)
FROM
 employees
GROUP BY
 name, age
WHERE
 name is not null and age is not null;

Here we always get biggest value so any other than null should be present if available.

For deduplicate data in your table, one way could be insert desired result into a temporary table, delete old data from employee table and insert data back from temporary table:

create table temporary_employee (
 name varchar,
 age integer,
 height integer,
 eye_color varchar,
 weight integer
);

insert into temporary_employee (name, age, height, eye_color, weight) 
(
SELECT
 name,
 age,
 max(height),
 max(eye_color),
 max(weight)
FROM
 employees
GROUP BY
 name, age
WHERE
 name is not null and age is not null
);

DELETE FROM employees;

INSERT INTO employees (name, age, height, eye_color, weight) (
SELECT name, age, height, eye_color, weight FROM temporary_employee);

DROP table temporary_employee;

In some database engines there is insert command with has an argument to overwrite all data, but I didn't find such paramter in PostgreSQL.

Another option could be add a temporary column into your table, insert wanted data to the table and delete old data.

ALTER TABLE employee ADD COLUMN newdata bool;

--Insert wanted data to the employee table and mark it as newdata
INSERT INTO employees (name, age, height, eye_color, weight, newdata) 
(
SELECT
 name,
 age,
 max(height),
 max(eye_color),
 max(weight),
 't' as newdata
FROM
 employees
GROUP BY
 name, age, newdata
WHERE
 name is not null and age is not null
);

-- Delete old data from the table
DELETE FROM employees WHERE newdata != 't';

---Remove temporary column
ALTER TABLE employees DROP COLUMN newdata;
ex4
  • 2,289
  • 1
  • 14
  • 21
  • Thank you very much, it seems to work! But how can I make the change to my table with this command (I want my table to be like the result of this request)? I also forgot something in my question: how would I ignore `name` and `age` that are `null`? – victorfink May 12 '22 at 14:37
  • I updated answer how you can ignore rows where age or name is null. – ex4 May 12 '22 at 14:49
  • How to update your table according to the result is harder. Have to think about it a bit. Or someone smarter than me can answer that :) Would it be enough for you to create view or new table with this dedublicated data? – ex4 May 12 '22 at 14:52
  • Thanks. Unfortunately no it wouldn't work for me. I need to have this table as clean as possible (and not a view of it, nor a new one). – victorfink May 12 '22 at 15:09
  • But can you empty the table temporarily like in my example above? Or are you able to add a temporary column into `employee` table? – ex4 May 12 '22 at 15:31
  • I don't feel comfortable emptying the table, but adding a temporary column might do the trick for me. I will test the solution and accept the answer if everything's good :) – victorfink May 13 '22 at 06:56
  • Thanks, I'm currently trying to do use the temporary table, however, how can I only delete the rows that are duplicated? You used `DELETE FROM employees;` but it will delete the whole table right? I was thinking of something like `DELETE FROM employees WHERE ...` and then put all rows that are in a `GROUP BY` of at least 2 elements (sorry I'm not good in SQL so I'm struggling) (don't bother with the temporary column as it won't work with my unicity constraint on some columns if I'm not mistaken) – victorfink May 13 '22 at 15:30