5

The command is:

count 'tableName'. 

It's very slow to get the total row number of the whole table.

My situation is:

  • I have one master and two slaves, each node with 16 cpus and 16G memory.

  • My table only has one column family with two columns: title and Content.

  • The title column at most has 100B bytes, the content may have 5M bytes.

  • Right now the table has 1550 rows, every time when I count the row number, it would take about 2 minutes.

I'm very curious why hbase so slow on this operation, I guess it's even slower then mysql. Is Cassandra faster than Hbase on these operations?

Laurel
  • 5,965
  • 14
  • 31
  • 57
Jack
  • 5,540
  • 13
  • 65
  • 113

2 Answers2

7

First of all, you have very small amount of data. If you have that kind of volume, then IMO using NoSql would provide you no advantage. Your test is not appropriate to judge performance of HBase and Cassandra. Both have their own use cases and sweet spots.

count command on hbase is running a single threaded java program to do counts of rows. Still, I am surprised that its taking 2 mins to count 1550 rows. If you would like to do counts in faster way(for bigger dataset) then you should run MapReduce job of HBase Row_Counter.
Run MapReduce job by running this:

bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter

Anil Gupta
  • 1,116
  • 2
  • 8
  • 13
  • Thank you so much Anil, One more question please, When the Map/Reduce running by RowCounter, is the Map/Reduce Api querying HDFS files directly or querying RegionServers? Thanks Jack – Jack Apr 28 '15 at 23:01
  • No, its not reading HDFS directly. RowCounter MapReduce job is using java API of HBase to read data of HBase table. – Anil Gupta Apr 29 '15 at 04:15
  • But why I can not find any trails of HConnection in this class source code if it directly talk with Hbase, and look at the key type is ImmutableBytesWritable, It's very hadoop style. But I can imagine that if it connect with RegionServer it would take the advantage of Hbase cache layer. but if directly connect with HDFS that would reduce the bandwidth overheads. What do you think please, thanks a lot! – Jack May 01 '15 at 00:04
3

First of all, please remind that to make use of data locality, your "slaves" (better known as RegionServers) must have also the DataNode role, not doing so is a performance killer.

Due performance reasons HBase does not mantain a live counter of rows. To perform a count the HBase shell client needs to retrieve all the data, and that means that if your average row has 5M of data, then the client would retrieve 5M * 1550 from the regionservers just to count, which is a lot.

To speed it up you have 2 options:

  • If you need realtime responses you can maintain your own live counter of rows making use of HBase atomic counters: each time you insert you increment the counter, and each time you delete you decrement the counter. It can even be in the same table, just use another column family to store it.

  • If you don't need realtime run a distributed row counter map-reduce job (source code) forcing the the scan to just use the smallest column family & column available to avoid reading big rows, each RegionServer will read the locally stored data and no network I/O will be required. In this case you may need to add a new column to your rows with a small value if you don't have one yet (a boolean is your best option).

Rubén Moraleda
  • 3,017
  • 1
  • 18
  • 20
  • 3
    Thank you so much Ruben, your answer very helpful! I hope StackOverflow could provide multiple answers selection, thus, I would select yours as answer as well :) -Jack – Jack Apr 28 '15 at 21:15