1

Say I start an cluster on Amazon elastic mapreduce and have one Master node instance, 2 core node instances and 15 task node instances.

I think I uploaded around 1 TB of data into hbase using mapreduce jobs and incremental uploads.

Now -

  1. How do I find the table size and region splitting (bytes). Normally on CDH I would do a hadoop fs -du /hbase. But there is not /hbase directory on my master node.

  2. I am also curious to know how the region server allocation will work. So even if I have 100 regions - if I have 1 master node - it means the whole IO will be throttled right ?

Thanks Regards

Run2
  • 1,839
  • 22
  • 32

1 Answers1

0

Did you start up a HBase cluster in Amazon AWS using Elastic Map Reduce? Or just a Hadoop cluster?

  1. "hadoop fs -du /hbase" does work for me on HBase-on-EMR. Can you double check?
  2. If you haven't pre-split regions, etc, HBase will take care of this for you. As for I/O throttling, have a look at the HBase documents/videos - when a client needs to read/write from HBase, it will cache the results from -ROOT- and .META. and contact the region servers directly instead of going through the master.
Suman
  • 9,221
  • 5
  • 49
  • 62
  • Suman - sorry I saw this answer after a long inactivity in general on StackExchange. Well my question was regarding allocation of Region Server in EMR. The client knows the data node , true, but in the case I described you have many more regions as compared to data nodes. So you cannot run as many Region Servers on the Data Nodes. So how do Regions Servers get allocated. Does the Master node run one more more region server processes ? . – Run2 May 10 '14 at 11:15