I've started on a analytics project. The use cases are to understand the customer buying patterns and the data sources like Web logs, Relational Databases (which holds product master, customer master). The relational database team, the hadoop team are entirely different. During the Architecture discussion it was discussed the Master data (Product, Customer, ) would be a one-time load and incremental updates would be a daily sqoop from oracle to hdfs and using Hive need to generate a current view (with all the latest product changes). Started with the Product details.
- The product master is approx 10G on Oracle side.
- The daily increment varies from 5 MB to 100 MB.
Based on my understanding creation of such small files would be a load on the name node on a long run.
As anybody come across such solution and how are you handling it?