I am saving 14.5Million records to HBase. Each row has 20+ columns. I tried first inserting 0.7 Million records, which went very smooth and finished in 1.7 mins.
Then i tried to insert actual and full data which is 14.5 Millions. If i tried to insert all of them once, it is taking lot of time. It ran for 1.5 Hours.
Spark is my programming model. I tried both using saveAsNewHadoopDataSet using TableOutPutFormat and with cloudera's hbase-spark bulkput.
Both seems to be using same. I am running on 8 Nodes cluster, with 8 regions servers and using only single column family. I have assigned 4GB heap for both region server and master.
I am not sure, if i am missing anything or HBase really chokes for huge data insert at once.
Please provide your toughts. I am also planning to install pheonix layer, so that i can using dataframe abstraction directly over HBase data and save the dataframe directly to HBase.
I am still struggling to find out how can HBase choke just for 14.5 Million records. The data is just around 9 GB.