Dealing with Master Data Updates in hadoop

Question

I've started on a analytics project. The use cases are to understand the customer buying patterns and the data sources like Web logs, Relational Databases (which holds product master, customer master). The relational database team, the hadoop team are entirely different. During the Architecture discussion it was discussed the Master data (Product, Customer, ) would be a one-time load and incremental updates would be a daily sqoop from oracle to hdfs and using Hive need to generate a current view (with all the latest product changes). Started with the Product details.

The product master is approx 10G on Oracle side.
The daily increment varies from 5 MB to 100 MB.

Based on my understanding creation of such small files would be a load on the name node on a long run.

As anybody come across such solution and how are you handling it?

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

I don't see any problem yet. If you are starting with one big file and adding 1 file each day you will result with ~1000 files after a year which isn't a problem (at least not for the name node).
Still, its not optimal to hold small file in HDFS no matter the amount.
I'd recommend you take a applicative approach to this and merge the files after enough time has passed, For example :

Create monthly partitions on your table (product master), each day insert the new file to the table, after the month has ended, insert overwrite the data back to the same partition.
If the data assertion isn't done simply by insertion but there is a more complex logic, the solution might be creating the master table, then copying the incremental data to a HDFS location and creating external table on that location.
Combining those two tables using union all in a view and creating a loading process to load the data once in a while from the HDFS to the master table when its possible.

If you do encounter a name node contention regarding too many small files you can read about ways of solving the "small files problem" here.

Dealing with Master Data Updates in hadoop

1 Answers1