Saving Huge data to HBase has been very slow

Question

I am saving 14.5Million records to HBase. Each row has 20+ columns. I tried first inserting 0.7 Million records, which went very smooth and finished in 1.7 mins.

Then i tried to insert actual and full data which is 14.5 Millions. If i tried to insert all of them once, it is taking lot of time. It ran for 1.5 Hours.

Spark is my programming model. I tried both using saveAsNewHadoopDataSet using TableOutPutFormat and with cloudera's hbase-spark bulkput.

Both seems to be using same. I am running on 8 Nodes cluster, with 8 regions servers and using only single column family. I have assigned 4GB heap for both region server and master.

I am not sure, if i am missing anything or HBase really chokes for huge data insert at once.

Please provide your toughts. I am also planning to install pheonix layer, so that i can using dataframe abstraction directly over HBase data and save the dataframe directly to HBase.

I am still struggling to find out how can HBase choke just for 14.5 Million records. The data is just around 9 GB.

score 3 · Accepted Answer · answered Aug 03 '16 at 06:42

3

May be you did not pre-split your table and hbase only uses 1 regionserver to write data ?

Please check table split count and if it has one split, you can split it after you insert 1 million records and truncate table then insert all of your data. Truncating table does not change split count, deletes all your data. Since you have 8 nodes , you need at least 8 splits in your table.

answered Aug 03 '16 at 06:42

halil

1,789
15
18

Hi Halil, I have split using the below command. hbase org.apache.hadoop.hbase.util.RegionSplitter table_name UniformSplit -c 8 -f column_family. But, still most of the requests are going through only one region server when i checked in HBase Master UI. – Srini Aug 03 '16 at 10:59
If all your request are going to a single node, maybe it is hotspotting, are your row_keys all starting the same or a they slightly different ? – Alexi Coard Aug 03 '16 at 11:22
What is your rowkey format , if its starting with timestamp, then it causes hotspotting, you should change. – halil Aug 03 '16 at 13:16
1

Thanks halil. I figured it out.. The rowkey starting characters are similar. I presumed the data will have enough randomness, but it was not. I set the presplit other way using the characters directly and configured rowkey properly. – Srini Aug 03 '16 at 14:13

score 0 · Answer 2 · answered Aug 03 '16 at 19:11

Have you thought about Splice Machine?

https://github.com/splicemachine/spliceengine

It can import around 100K records per node per second into HBase. It has a really simple bulk import command

http://doc.splicemachine.com/Administrators/ImportingData.html

It uses Spark internally for imports, compactions, and large queries.

One thing to also think about, is how you are storing the data into HBase. Storing each column separately can take up a lot of space.

Good luck...

Saving Huge data to HBase has been very slow

2 Answers2