0

I want to have separate data nodes for different projects(don't want to share data nodes between two project's)

I can see option in Cloudera Manger to have two different clusters using shared cloudera manager. So can i have separate cluster of just data nodes and share HDFS and Yarn managers with earlier cluster. Yarn/HDFS master services will have to have two separate fsimage's/edit logs and Resource Manger i guess(Or what ever server's configuration I will have on My master nodes those will be shared with new cluster data nodes and client process installed on it.)

Is it possible? Has anyone done that before. hows performance ? I am referring this document -> [http://www.cloudera.com/documentation/archive/manager/4-x/4-5-1/Cloudera-Manager-Enterprise-Edition-User-Guide/cmeeug_topic_6.html][ cloudera documentation for multi cluster using one CM]

Also can we set some rule for HDFS to store/use particular set of data nodes for particular set of data/directory only so that separation can be achived ?

Thanks in advance.

Yogesh
  • 191
  • 1
  • 2
  • 12

1 Answers1

0

The document you reference is how to manage multiple independent clusters using one cloudera manager installation. What you're looking to do I don't believe is possible. It's not the way hadoop is designed to work. Multi-tenant on hadoop is becoming much easier in the next 5.7 and 5.8 releases of CM and CDH. If you did manage it, and I'm not sure that you could, the performance would be pretty bad.

The typical thought process is to run your daemons (mr, hive, impala) as close to the data as you possibly can. If you're concern is having different datanodes for different clients, then you can solve that easily without trying to mesh clusters with quotas and good security in the form of kerberos and sentry. What services are you most interested in running? Yarn itself is just a resource manager so I'm guessing you're looking right now at MapReduce and HDFS. Do you plan to do any analytics? You'd want to use hive or impala for that.

JasonS
  • 161
  • 2
  • 14
  • Planning to use MapReduce and Spark only. I just want to utilize existing Yarn/HDFS NN services to my new added data nodes so that we don't have to invest in new 2-3 server nodes(NN,SNN,YARN RM,Edge node etc.) in addition to that support activity will be smooth. Existing project are utilizing all the resources RAM/CPU completely ..So just wanted to check if its possible to avoid sharing load with already overloaded cluster Data nodes(CPU&RAM utilization are 90% on existing data nodes so it makes sense to have separate processing flow with separate DN's if possible). – Yogesh May 12 '16 at 07:16
  • Also can we set some rule for HDFS to store/use particular set of data nodes for particular set of data/directory only so that separation can be achieved ? – Yogesh May 12 '16 at 07:36