0

I have a table in CockroachDB, I have populated the data into the table before applying the constraints to set the primary key, and because some of INSERT statement failed through the data-loading phase, some of the rows are loaded into the table more than one time by mistake.

The constraint I want to apply is:

CREATE UNIQUE INDEX ON "MyDB"."MyTable" ("Row_ID");

But as the duplicate data is already loaded into the table, I get the following error:

pq: multiple primary keys for table "MyTable" are not allowed

I have check to see if actually there are some duplicated rows with the following query:

SELECT  
    Row_ID,
    COUNT(Row_ID) as id
FROM    
    MyTable
GROUP BY 
    Row_ID
HAVING  
    COUNT(Row_ID) > 1;

And the query showed there are some duplicate rows.

What is the best way to remove the duplicate rows in CockroachDB?

Cyrus
  • 912
  • 2
  • 11
  • 21

2 Answers2

2

If they are an exact match you can create a new table with the distinct records.

Then delete all the old table and repopulate it from the table created in last step.

To create the table:

create table MyWorkData as select distinct * from MyTable;

To delete MyTable

delete from MyTable;

To repopulate MyTable. (Thankyou @benesch for correcting the statement)

INSERT INTO MyTable SELECT * FROM MyWorkData

Finally delte the working table.

Juan
  • 5,525
  • 2
  • 15
  • 26
  • Thanks, I tried it, each time I run the query: create table MyWorkData as select distinct * from MyTable;, the process will be Killed. don't know why, server does not give any other reason, using the command "free -m", I don't think it is because of memory. – Cyrus Aug 14 '17 at 20:16
  • You can also create the table and populate it with a 'select into' in two steps. – Juan Aug 14 '17 at 20:26
  • @Cyrus, you can get a more detailed failure report in the log files. Please do file an issue or come chat with us on Gitter! http://gitter.im/cockroachdb/cockroach – benesch Aug 15 '17 at 13:59
  • @Juan, CockroachDB doesn't support `SELECT... INTO`. You'd need to run `INSERT INTO MyTable SELECT * FROM MyWorkData` instead. – benesch Aug 15 '17 at 14:04
  • @benesch thank you for the correction. I've updated the answer. – Juan Aug 15 '17 at 14:08
2

If you don't care about which duplicated row you keep, run:

DELETE FROM MyTable WHERE rowid IN (
    SELECT MIN(rowid) FROM MyTable GROUP BY Row_ID HAVING COUNT(*) > 1
);

For any duplicates, that query will delete all but the row that was roughly created first.†

Note that rowid is not the same as your Row_ID column. rowid is an internal CockroachDB column that is magically created for tables with no primary key, and is guaranteed to be unique for each row in the table.

rowid stores <timestamp><node-id>, so it sorts roughly according to insertion time, but rows inserted near-simultaneously by nodes with skewed clocks may not sort in insertion order.

benesch
  • 5,239
  • 1
  • 22
  • 36
  • so I believe general recommendation will be to use subquery wherever a join is needed with v1 of cockroachdb ? – fortm Aug 24 '17 at 11:45