0

The ACID properties in Hive allow to delete rows from a table using the following syntax :

DELETE FROM table 
WHERE id IN (SELECT id FROM raw_table)

But what's the best solution to delete rows when the primary_key is composed of several columns ?

I have tried the following with an EXISTS :

DELETE FROM table 
WHERE EXISTS (SELECT id1, id2 FROM raw_table 
              WHERE raw_table.id1 = table.id1 AND raw_table.id2 = table.id2) 

Or the following (concatenating all the columns, not sure if this is valid) :

DELETE FROM table 
WHERE CONCAT(id1, id2) IN (SELECT CONCAT(id1, id2) FROM raw_table)

Do you have any advice on what is the best solution ?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Jean.W
  • 11
  • 4

1 Answers1

0

The solution using exist is valid. Also, your solution concatenating values is valid but depending on the values that you could find you could be deleting data that you don't want, For example

id1: 01
id2: 1

you will be deleting rows with 011, but it also match with

id1: 0
id2: 11

which is not expected. I would suggest add a delimiter between the ids.

DELETE FROM table 
WHERE CONCAT(id1, id2) IN (SELECT CONCAT(CONCAT(id1,"-"), id2) FROM raw_table)

Both solutions should perform only 1 Job with mapper and reduce phase, therefore the execution plan and performance should be almost the same

Regards!

hlagos
  • 7,690
  • 3
  • 23
  • 41
  • Thanks a lot for your answer. I actually went with the first approach using the EXISTS, do you think it will be more time consuming ? – Jean.W Aug 08 '17 at 11:46