Why doesn't Kudu fail when inserting duplicate primary key?

Question

From Impala documentation:

In most relational databases, if you try to insert a row that has already been inserted, the insertion will fail because the primary key would be duplicated. Impala, however, will not fail the query. Instead, it will generate a warning, but continue to execute the remainder of the insert statement.

Why does Impala/Kudu act like that? Please note that the insert won't update the value (there is an upsert command for that), it will just fail silently.

Is there a way to be aware that I'm inserting a duplicate primary key?

I'm not sure "why" but I don't see it as a problem. If you anticipate duplicate unique primary keys then either choose new primary keys or use an UPSERT. Other than that, react to warnings so that you can take that next step (change the keys or switch to upsert). — JNevill, May 31 '19 at 15:21

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

This is because kudu itself will not throw any exception (only raise warning) and hence impala will (rightly) assume the task succeeded.

As to why Kudu chose to do it this way we can only speculate.

This is just my opinion. Kudu (and Impala) is designed for analytical work-load instead of transactional work-load. Which usually involves batch processing of large amounts of data. It would be undesirable to for the application to fail because of small number of records with duplicate keys.

Thus default behaviour inserts all records with non-duplicate keys and skip all the duplicate keys. This can be changed by using upsert which replaces replaces duplicates.

According to Imapala documentation

If an INSERT statement attempts to insert a row with the same values for the primary key columns as an existing row, that row is discarded and the insert operation continues. When rows are discarded due to duplicate primary keys, the statement finishes with a warning, not an error. (This is a change from early releases of Kudu where the default was to return in error in such cases, and the syntax INSERT IGNORE was required to make the statement succeed. The IGNORE clause is no longer part of the INSERT syntax.)

For situations where you prefer to replace rows with duplicate primary key values, rather than discarding the new data, you can use the UPSERT statement instead of INSERT. UPSERT inserts rows that are entirely new, and for rows that match an existing primary key in the table, the non-primary-key columns are updated to reflect the values in the "upserted" data.

If you really want to store new rows, not replace existing ones, but cannot do so because of the primary key uniqueness constraint, consider recreating the table with additional columns included in the primary key.

They chose to do this because if an insert statement has multiple values clauses (inserting multiple lines in one statement), Kudu processes the clauses on multiple threads. If one of them fails, picking the ones that are already done and rolling them back would be a challenge they did not wish to take on. (Sorry but can't remember where I read this.) — zslim, Oct 01 '21 at 11:05

Why doesn't Kudu fail when inserting duplicate primary key?

1 Answers1