hdfs partitioned data back up when overwriting a hive table

Question

I have an external table and partitioned on 3 columns and stored in hdfs:

/apps/var/db_name/table_name/date=1901/ref_src=src_a
/apps/var/db_name/table_name/date=1901/ref_src=src_b
/apps/var/db_name/table_name/date=1902/ref_src=src_a
/apps/var/db_name/table_name/date=1902/ref_src=src_b
/apps/var/db_name/table_name/date=1903/ref_src=src_a
/apps/var/db_name/table_name/date=1903/ref_src=src_b
/apps/var/db_name/table_name/date=1903/ref_src=src_c

Fields in table_name:

date|ref_src|col_a|col_b|

Now, based on some requirement i have to create new columns - col_c and col_d. So, I have planned some program to overwrite the calculated data to the same table. But, I have problem/issue triggered that what if there is any issue cause when my program is running so that the table data is corrupted or deleted or my program is having some issue?

So, my main question is how do I take table backup of hdfs data (mean table data) and the partition details. If I take complete directory as a back up will help or anything else I need to take care, my concern is mainly on prod data.

score 0 · Answer 1 · answered Jun 28 '20 at 11:12

0

You can take a copy of the folders to be used as backup. If your data is stored (for instance) in parquet format you can always restore those folders and create another external table on top of them. That external table can be used to feed the other original table for assigning default values to new fields as well, doing something like

insert into table_with_new_cols select r.*, 'col_c value' col_c, 'col_d value' col_d from restored_external_table r

That query works (avoiding dupes) in case that table_with_new_cols is empty

answered Jun 28 '20 at 11:12

Oscar Lopez M.

585
3
11

I didn't get you what exactly you are saying. – Rocky1989 Jun 30 '20 at 14:36
That's no problem @SureshGudimetla. The situation I explained is: 1) you can make a copy of the current folders and use them as backup. 2) If you need to add new partitioned columns you can just add those two columns with ALTER TABLE ... ADD COLUMN command. 3) If you need to recreate the table, just create it with the schema you want and to reinsert data there you can use the insert I mentioned on my answer. Please let me know if it's cleaner now. If not, I can explain in more details. Thank you – Oscar Lopez M. Jul 01 '20 at 17:04
Thanks for your clarification, **1)** I understood your point and it worked for my other scenario in creating new partitioned columns. Thank you. **2)** But one point I have to update is - the new columns I'm adding to the table is not partition columns. So, what I have thought is on the same partitioned data, I will over write my new data with new columns and do a msck refresh will that helps right, is there any issue you see in this point? Added my 3rd point in another comment. – Rocky1989 Jul 01 '20 at 18:10
**3)** When I'm overwriting my dataif there is any issue happened, I can place the back data as it is and without doign any other changes so that I have my prod data not altered. So, here any thing else i need to take care? – Rocky1989 Jul 01 '20 at 18:10
Yes, you can place the files back as is. It could happen that Hive metastore would need to be refreshed. For that you can use MSCK REPAIR TABLE yourtable if you see that the data is not shown when you query. That happens if Hive metastore is not up to date with metadata of the files or partitions (https://stackoverflow.com/questions/50832059/msck-repair-hive-external-tables) – Oscar Lopez M. Jul 01 '20 at 18:36

hdfs partitioned data back up when overwriting a hive table

1 Answers1