3

I am using Spring + Datanucleus JDO + Hbase. Hbase is on a fully distributed mode with two nodes. I am facing serious performance issues here.

My webapp can be considered as a pinger which just keeps pinging URLS and stores their response. Hnce my app runs multiple threads for INSERT into db. I have observed that once the number of concurrent writes exceeds around 20 , the inserts start taking a lot of time (some take even 1000 secs). And when this happens READS start failing too and my webapp is not able to extract any data from the db (my webapp hangs). I am not much of a NoSQL db guy and hence do not know where to start looking for performance.

My major configurations are: Zookeeper quorum size: 1 Hbase regionservers: 2 Data Nodes: 2 hbase.zookeeper.property.maxClientCnxns: 400 replication factor:3

Do I need to increase the heap size for Hbase ? Should a high WRITE throughput have effect on READ ?

Am I doing something wrong with the configuration? It seems writing to a file would be faster that writing data to Hbase . This is my last shot at Hbase. Please help

Akash Bhunchal
  • 79
  • 1
  • 2
  • 10

2 Answers2

2

The big problem that I see is you are running HBase on 2 nodes with a replication factor of 3 (actually in effect just 2 as there are only 2 nodes to replicate to). This means all writes must be replicated to both nodes. HBase really needs at least 5 or so nodes to get going.

It sounds like you are filling up your first region and it is splitting, during the split once the MemStore fills up you will start blocking. You should look into creating your table pre-split into multiple regions that will give you an even distribution of writes.

I recommend taking a look at the HBase book's chapter on performance, specifically the part on pre-splitting tables.

You should also use compression, make sure you get native compression working (gzip, lzo or snappy) - don't use the pure Java compression otherwise you'll be really really slow, the link discusses that a bit.

cftarnas
  • 1,745
  • 10
  • 9
  • @cftrnas seems I have to do a lot to make it production worthy. One question though. Do I need to do any optimization on hadoop too besides hbase ? I have done the minimal like increasing the ulimit and all. I currently cannot run hbase on more than two nodes, would decreasing the replication factor to 1 help ? – Akash Bhunchal Aug 30 '11 at 15:26
  • If this is just a dev/test install then yes - definitely go to a replication factor of 1. In production you will want (need) more nodes and a replication of 3. I also cannot stress how important it is to [pre-split your tables](http://ofps.oreilly.com/titles/9781449396107/performance.html#perfsplitcompactpresplit) when you create them. Also consider compression and increasing your region size. – cftarnas Aug 30 '11 at 15:57
  • thanks for the link. I actually did most of the things mentioned in there and am now able to get a good READ performance. The main issue for my app hanging(while reading) was default value(10) for hbase.regionserver.handler.count for my region servers. When close to 30 writes were happening I was not able to READ. I have not done pre splitting of tables coz my biggest table size is like 19 Mb and Hbase splits it by default when it reaches 256 Mb. Do you still recommend pre splitting coz my WRITE through put is till very low. – Akash Bhunchal Sep 02 '11 at 03:54
  • I have observed that WRITEs take anywhere between 7 sec to 63 secs on my biggest(~19 Mb) table. Another thing I have observed is that the time taken to WRITE increases linearly from 7 secs to 63 secs then again falls to 7 secs and then oscillates in the same manner. Is this OK ? I currently have only one region server and one region for my biggest table (other tables are insignificant < 1 Mb). Would increasing the count of region servers and pre splitting the table increase my WRITE through put ? Even writes to my smallest of tables takes a lot of time(between 7-63 secs). – Akash Bhunchal Sep 02 '11 at 04:03
  • Are your keys sequencial that you are inserting? If not, then yes - creating two splits will allow both of your nodes to work. If they are sequencial (and considering your data it sounds like they could be) then you should look into changing your rowkeys. Also take a look at [OpenTSDB](http://opentsdb.net/) - that sounds like it might be very useful for what you are doing. – cftarnas Sep 02 '11 at 04:04
  • What are your regionservers doing during those writes (io wait?) Can you post to pastebin some of your regionserver logs? You could also ask over on the [hbase user mailing list](http://hbase.apache.org/mail-lists.html), the gurus there are quite helpful. – cftarnas Sep 02 '11 at 04:07
  • OpenTSDB seems to be exactly what I am looking for :D . My keys are not sequential (I am using NATIVE id genarator strategy which look like UUID). I think it makes sense to split my tables for WRITE. Would adding new region server help as in my deployment the other region server would be residing on a separate physical machine (different network) as my deployment is on amazon EC2. what if I create a new region server and then do a split on my table, would regions be created evenly distributed on the two region servers. Can I control the creation of regions across region servers ? – Akash Bhunchal Sep 02 '11 at 04:46
  • yes - more regionservers helps. Actually you want your regionservers to be on separate machines, although different networks are not good. Possibly I misunderstand you - are you saying some of your nodes are EC2 and some not? If they are all EC2 just do your best to make them all in the same zone. Generally you cannot control exactly where regions go - they get automatically balanced across all available regionservers. – cftarnas Sep 02 '11 at 05:08
  • all my nodes are on EC2, its just that they are not in the same geographical location i.e one is on east coast of US and other on the west coast. I will try with one more region server and table split. Thanks for the help!!! – Akash Bhunchal Sep 02 '11 at 07:21
  • Having your HBase nodes in two different EC2 regions is definitely causing some of your slowdown - that is quite a bit of network latency between nodes! What are the reasons you have it setup that way? Is there anyway they could all be in the same region and availability zone? – cftarnas Sep 02 '11 at 16:45
  • My deployment needed to have machines running in different geographical locations (as I do url monitoring). I wanted to playaround with hbase and could not afford a third machine for that :D Does usual deployment of such distributed dbs happen in the same physical network ? – Akash Bhunchal Sep 05 '11 at 17:14
  • That deployment really handicapps HBase. HBase (and all other BigTable like systems) is meant to be run in a cluster. It shards data across all nodes depending on rowkey, and data gets distributed even further with HDFS replication. What was your goal in distributing your nodes? – cftarnas Sep 06 '11 at 06:52
0

If you're going to write to HBase using multiple threads, you need to make sure you are reusing your HBaseConfiguration as often as possible. Otherwise, each thread makes a new connection and ZK will eventually stop offering connections until old ones close.

A quick solution is to let a singleton handle passing the configuration to your HTable objects. This should guarantee the same configuration is used and will minimize your concurrent connections.

Tony
  • 2,473
  • 1
  • 21
  • 34
  • I am using datanucleus as the ORM and am making use of PersistenceManagerFactory. This is the level of abstraction I am working at, I am not accessing HTable and the associated hbase objects. I guess PMF would be reusing the connections. But I could not find a way where one could specify connection pooling with datanuceus with hbase (unlike RDBMS). Is connection pooling possible at the level of abstraction I am working at ? – Akash Bhunchal Aug 30 '11 at 15:40
  • I'm really not familiar with Datanucleus. So I can't comment on how the connections are being handle. However, you can go to the web console for HBase master (something like ipofhbasemaster:60010) and see the ZK dump. That will list all active connections to ZK. If it exceeds 400 (your limit) it will deny new connections. – Tony Aug 30 '11 at 16:52