0

I have a Hive External table X in HDFS. Files from RDBMS will keep coming to the folder location of the table X.

Last week there was a new column added to RDBMS, and the files came into the external table with the new column's data.

I know that i should add a new column to the Hive external table in this case.

But, How do i prevent files with new columns coming to my External folder.??

or atleast how do i recognize that the new column is coming?

AKC
  • 953
  • 4
  • 17
  • 46
  • See if this link is useful : https://developer.ibm.com/hadoop/blog/2014/11/07/big-sql-data-ingestion-statements-load-hadoop-statement-best-practices/ – Ravindra babu Jan 27 '16 at 03:36

2 Answers2

2

You could write a Bash script which checks the number of columns in today's file (i'm assuming your files are partitioned by date).

expected_n_columns=10
DATE=`date +%Y-%m-%d`
n_commas=`hadoop fs -cat /X/$DATE/* | head -n 1 | tr -cd , | wc -c`
n_columns=`expr $n_commas + 1`
echo "today's file had $n_columns columns."
maxymoo
  • 35,286
  • 11
  • 92
  • 119
1

Either source team need to communicate to you about the changes they are making. If your enterprise have change control and a review board, you need to be part of change control review for the applications you are getting data from.

If change control is not possible, as part of your data integration process you need to check the database tables for any changes. If there are changes then the process should notify about the changes.

If you need not worry about new columns, you have to get data from source using "select from ", this will not fetch the data for new columns.

Finally, it depends up on how you want to tackle it. There is no out of the box solution, it is typical data integration problem for which you need to have custom solution as per your organization practices.

Durga Viswanath Gadiraju
  • 3,896
  • 2
  • 14
  • 21