Best approach for incremental data load in apache hive where sqoop is not loading the data into HDFS

Question

I have a scenario - using a utility data file[with same name every time - after deleting the old file first and then loading updated/latest file] is pushed from source system into HDFS at a defined location on regular basis. That data file needs to pushed into a hive table further.

I am looking for best approach for achieving this:

1) Create an external table in hive pointing to position in HDFS where data is pushed by the utility in HDFS, this will keep updating the hive table data by itself as soon as the data file is updated by the utility.

2) Create an internal/managed table hi hive with data present in hive warehouse. Now using a shell script configured as cron job - in regular interval check the timestamp of file present in hive warehouse and file present at location where utility is pushing/updating the data, if timestamp is not same then the script will delete the file from hive warehouse and copy the latest file into it.

3) Every time create new hive table with some other name, load the latest data file into it, delete old table and finally rename the new table to old table name in hive.

4) Any other approach?

I need expert comments on which approach seems to be the best??? I am also worried about a scenario where I will be performing any data file change using any of these approaches and somebody accessing that table at the same time... Is there any way to avoid such scenarios?

Note: Data load into HDFS/Hive is not happening through Sqoop as incremental load. Also, every time updated file will be having whole old/updated data along with new data again. Hence the size of this file may increase with time as it will be having both old, updated and new data as well.

Regards, Bhupesh

Subash · Answer 1 · 2017-12-20T18:52:15.673

if you are using Hive version above 0.14,then the above operations will be possible.You should use the upsert feature of hive.

You can create a external table and a managed table.Both should have same table structure -External table will have all the data from utility file(raw form) and by the upsert feature you can update,insert as per the need (SCD II) in the internal table.Sample I have created below A as external and B as internal.More you can find in this link https://hortonworks.com/blog/update-hive-tables-easy-way/

merge into A using B on A.id = B.id when matched then update set email=B.email, when not matched then insert values(B.id, B.email);
Finally,you can set an oozie work flow which will do the insertion/update/delete periodically or set any cron job.Also you need to delete the data periodicallly from external table to maintain hdfs.

or

You can use kudu instead of hive.Please refer to my answer here

Best approach for incremental data load in apache hive where sqoop is not loading the data into HDFS

1 Answers1