I have a scenario - using a utility data file[with same name every time - after deleting the old file first and then loading updated/latest file] is pushed from source system into HDFS at a defined location on regular basis. That data file needs to pushed into a hive table further.
I am looking for best approach for achieving this:
1) Create an external table in hive pointing to position in HDFS where data is pushed by the utility in HDFS, this will keep updating the hive table data by itself as soon as the data file is updated by the utility.
2) Create an internal/managed table hi hive with data present in hive warehouse. Now using a shell script configured as cron job - in regular interval check the timestamp of file present in hive warehouse and file present at location where utility is pushing/updating the data, if timestamp is not same then the script will delete the file from hive warehouse and copy the latest file into it.
3) Every time create new hive table with some other name, load the latest data file into it, delete old table and finally rename the new table to old table name in hive.
4) Any other approach?
I need expert comments on which approach seems to be the best??? I am also worried about a scenario where I will be performing any data file change using any of these approaches and somebody accessing that table at the same time... Is there any way to avoid such scenarios?
Note: Data load into HDFS/Hive is not happening through Sqoop as incremental load. Also, every time updated file will be having whole old/updated data along with new data again. Hence the size of this file may increase with time as it will be having both old, updated and new data as well.
Regards, Bhupesh