6

I have a very large Redshift database. The records do not have unique keys or ids. I'd like to remove all of the duplicates with the most efficient query possible.

Other stackoverflow questions about typical sql databases suggested copying the table and skipping the duplicates during that process, but that seems suboptimal for a giant redshift database.

Any better solutions out there?

johncorser
  • 9,262
  • 17
  • 57
  • 102
  • one of the answers to http://stackoverflow.com/questions/15170701/amazon-redshift-keys-are-not-enforced-how-to-prevent-duplicate-data shows how to identify the duplicates, delete them, then insert one instance of each back to the main table. – mc110 Jul 01 '14 at 14:13
  • Unfortunately this isn't exactly the same, he had primary keys that weren't enforced, whereas I have no primary key in this table. I have to remove duplicate records based on the fact that the values for all of the attributes for each record are identical. This probably wouldn't be very difficult for someone with sharper sql skills than I have. – johncorser Jul 01 '14 at 14:22
  • can't you do something similar to disq's answer where you group by all rows to find the duplicates and insert those into a temp table (you will only get one instance of each due to grouping), delete matching rows in the main table, then insert back from the temp table? – mc110 Jul 01 '14 at 14:31
  • 1
    @mc110 - Due to how Redshift manages its data blocks that would likely both dirty the table and be relatively slow. – MatBailie Jul 01 '14 at 14:33

1 Answers1

12

One thing to bare in mind with Redshift is that deleted records are only actually "soft" deleted until VACUUM is run.
- They remain in the table, marked as to-be-ignored
- They're only deleted after a Vacuum

However, a VACUUM on a large table with deletes scattered through it is very often actually slower than a "Deep Copy". (Duplicate the data into another table, using GROUP BY or DISTINCT to eliminate the duplicates, TRUNCATE the original table and re-insert the data or drop the original table and rename the new table.)

This is a general rational for why you may actually benefit from what feels like the "slow" process.


Also, if two rows really are identical then there is no way (by definition) to uniquely identify one row. That being the case you can't differentiate between one to be kept and ones to be deleted.

One "trick" in other RDBMS is to use ROW_NUMBER() inside of a Common Table Expression and then delete from that CTE. (With the CTE creating the unique identifiers, allowing you to identify individual rows to be kept or deleted.) Unfortunately Redshift doesn't currently support deleting from a CTE.

Until this changes, Deep Copy (copying to a separate table while using GROUP BY or DISTINCT) is currently your only option.

Even so, the Deep Copy option may still be more valid in Redshift even if deleting from a CTE does ever become possible.


EDIT :

Correction:

If any row in a Redshift Table has been deleted, any subsequent VACUUM will reprocess the entire table (regardless of where the deleted rows are, or how many deleted rows there are).

(It's more sophisticated when VACUUMing following an INSERT, but down-right-ugly following a DELETE.)

I've also noticed that a Deep Copy uses less disk space than a VACUUM. (Which only came to my attention when we ran out of disk space...)


EDIT :

Code Example:

CREATE TABLE blah_temp (
  <Exactly the same DDL as the original table, especially Distribution and Sort keys>
)
;

INSERT INTO
  blah_temp
SELECT DISTINCT
  *
FROM
  blah
;

DROP TABLE blah;

ALTER TABLE blah_temp RENAME TO blah;

Or...

CREATE TABLE blah_temp (
  <Exactly the same DDL as the original table, especially Distribution and Sort keys>
)
;

INSERT INTO
  blah_temp
SELECT
  *
FROM
  blah
GROUP BY
  a, b, c, d, e, f, g, etc
;

TRUNCATE TABLE blah;

INSERT INTO
  blah
SELECT
  *
FROM
  blah_temp
;

DROP TABLE blah_temp;


Related Link: https://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html

MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • Would you mind editing your post to include a generic code snippet/example that accomplishes this (using group by or distinct)? – johncorser Jul 01 '14 at 14:46
  • 2
    You can do it without DDL. You can also re-define sortkeys and distkey while you're at it: `create table tableName_unique sortkey (col1,col2) distkey (col1) as select distinct * from tableName;` – Eugene Kulabuhov Sep 07 '16 at 22:07
  • I have a table in redshift with 87 rows - if I try any of the above operations I get `Invalid operation: Intermediate result row exceeds database block size;`complaint. What should I do in this situation? – Ogaday Sep 09 '16 at 16:00
  • 1
    @Ogaday - Raise a support ticket - that appears to be a bug. – MatBailie Sep 10 '16 at 12:04
  • @MatBailie Oh interesting - it's not due to memory constraints? Has a ticket already been filed? – Ogaday Sep 12 '16 at 11:25
  • Thanks, I made a new table and reduced the width of each column (used to be 65000 character varchars) and now I can `select distinct *` – Ogaday Sep 16 '16 at 07:45
  • Looks like AWS moved the documentation for deep copy: https://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html – nofinator Feb 23 '17 at 18:31