best practice to load multiple client data into Hadoop

Question

We are creating POC on Hadoop framework with Cloudera CDH. We want to load data of multiple client into Hive tables.

As of now, we have separate database for each client on SQL Server. This infrastructure will remain same for OLTP. Hadoop will be used for OLAP. We have some primary dimension tables which are same for each client. All client database has exact same schema. These tables have same primary key value. Till now, this was fine as we have separate database for client. Now we are trying to load multiple client data into same data container (Hive tables). Now we will have multiple row with same primary key value if we load data directly into Hive from multiple SQL Server databases through Sqoop job. I am thinking to use the surrogate key in Hive tables but Hive does not support auto increment but can be achieved with UDF.

We don't want to modify the SQL Server data as it's running production data.

a. What are the standard/generic way/solution to load multiple client data into Hadoop ecosystem?

b. How primary key of sql server database table can be mapped easily to Hadoop Hive table ?

c. How we can ensure that one client is never able to see the data of other client?

Thanks

Do you really expect to have a "primary key" in Hive? And run a R-OLAP query with multiple joins on multiple dimensions??? Good luck... — Samson Scharfrichter, Jan 27 '16 at 19:02
*"All client database has schema"* - do you mean *the same exact schema*? — Samson Scharfrichter, Jan 27 '16 at 19:04
Do you have any real requirement to merge the data from different clients in the same DB and the same tables? In that case, did you consider using specific partition(s) for each client in every fact table? Otherwise, why not create a dedicated Fact DB per client plus a common DB for common Dimensions? — Samson Scharfrichter, Jan 27 '16 at 19:07
@SamsonScharfrichter All client database has same schema. Suppose two clients are C1 and C2. Each SQL server database has same T_user table and 1 as primary key for user for both C1 & C2. Now if we try to store this data into Hive table then we can insert 1 for two different client's user. Is there any standard solution or framework to map sql server keys to hive tables? — 107, Jan 28 '16 at 07:35
The standard solution is to **do the work of an IT architect**... — Samson Scharfrichter, Jan 28 '16 at 07:52

score 0 · Accepted Answer · answered Jan 28 '16 at 11:08

0

@Praveen: Use mappers to overcome the downtime for each clients data to Hadoop servers, as Client data holds the primary keys in this case. Use the best use of Partitions for each client and with respect to Date partition. You have to implement TDE zone for HDFS file location, before you start sqoop import. *TDE: Trasparent Data Encryption zone, best practice for secured zone for your client data.

answered Jan 28 '16 at 11:08

Kiran teja Avvaru

334
3
10

Could you describe what is the mapper ? Are you talking about mapper and reducer? Any pointers about mapper and how to use them to manage multiple client data. – 107 Jan 28 '16 at 11:22
you can consider mappers if you have primary key in your source table. Use 'm' argument to pass n no. of mappers for it. Mappers are actually transforms your table data, that could be based on no. of mappers, that data would be splitted based on rows for distributed blocks, parllell computing. And there is no custom reducer involved in Sqoop for Data import. Identity Reducer comes into the picture to do their default job internally. Search on '-m' parameter usage, in sqoop documents. – Kiran teja Avvaru Jan 29 '16 at 11:46

best practice to load multiple client data into Hadoop

1 Answers1