How to remove duplicate rows in CockroachDB

Question

I have a table in CockroachDB, I have populated the data into the table before applying the constraints to set the primary key, and because some of INSERT statement failed through the data-loading phase, some of the rows are loaded into the table more than one time by mistake.

The constraint I want to apply is:

CREATE UNIQUE INDEX ON "MyDB"."MyTable" ("Row_ID");

But as the duplicate data is already loaded into the table, I get the following error:

pq: multiple primary keys for table "MyTable" are not allowed

I have check to see if actually there are some duplicated rows with the following query:

SELECT  
    Row_ID,
    COUNT(Row_ID) as id
FROM    
    MyTable
GROUP BY 
    Row_ID
HAVING  
    COUNT(Row_ID) > 1;

And the query showed there are some duplicate rows.

What is the best way to remove the duplicate rows in CockroachDB?

Juan · Answer 1 · 2017-08-15T14:07:38.243

2

If they are an exact match you can create a new table with the distinct records.

Then delete all the old table and repopulate it from the table created in last step.

To create the table:

create table MyWorkData as select distinct * from MyTable;

To delete MyTable

delete from MyTable;

To repopulate MyTable. (Thankyou @benesch for correcting the statement)

INSERT INTO MyTable SELECT * FROM MyWorkData

Finally delte the working table.

edited Aug 15 '17 at 14:07

answered Aug 14 '17 at 19:25

Juan

5,525
2
15
26

Thanks, I tried it, each time I run the query: create table MyWorkData as select distinct * from MyTable;, the process will be Killed. don't know why, server does not give any other reason, using the command "free -m", I don't think it is because of memory. – Cyrus Aug 14 '17 at 20:16
You can also create the table and populate it with a 'select into' in two steps. – Juan Aug 14 '17 at 20:26
@Cyrus, you can get a more detailed failure report in the log files. Please do file an issue or come chat with us on Gitter! http://gitter.im/cockroachdb/cockroach – benesch Aug 15 '17 at 13:59
@Juan, CockroachDB doesn't support `SELECT... INTO`. You'd need to run `INSERT INTO MyTable SELECT * FROM MyWorkData` instead. – benesch Aug 15 '17 at 14:04
@benesch thank you for the correction. I've updated the answer. – Juan Aug 15 '17 at 14:08

benesch · Answer 2 · 2017-08-14T19:55:17.047

If you don't care about which duplicated row you keep, run:

DELETE FROM MyTable WHERE rowid IN (
    SELECT MIN(rowid) FROM MyTable GROUP BY Row_ID HAVING COUNT(*) > 1
);

For any duplicates, that query will delete all but the row that was roughly created first.†

Note that rowid is not the same as your Row_ID column. rowid is an internal CockroachDB column that is magically created for tables with no primary key, and is guaranteed to be unique for each row in the table.

† rowid stores <timestamp><node-id>, so it sorts roughly according to insertion time, but rows inserted near-simultaneously by nodes with skewed clocks may not sort in insertion order.

so I believe general recommendation will be to use subquery wherever a join is needed with v1 of cockroachdb ? — fortm, Aug 24 '17 at 11:45

How to remove duplicate rows in CockroachDB

2 Answers2