I am doing a performance benchmark of Hbase and Spark, and stuck in a position where Spark takes >10 mins to read and consolidate 1 million rows out of 97 million rows present. Below are the details about configurations and data.
Hardware:
Total 4 servers.
1 - Master
3- Slaves
Each server config
CPU: Intel i5 processor
RAM: 16GB
HDD: 1TB (not SSD)
Software:
Hbase - 1.2.1
Spark - 1.6.2
Phoenix - 4.8.1
Hadoop - 2.6
Data size:
Total regions - 20
Total data size: 25 GB
Total rows: 97 million.
We have applied performance tuning configurations that I have understood from hbase guide and internet, below are important configurations from hbase-site.xml
<property>
<name>hfile.block.cache.size</name>
<value>0.4</value>
</property>
<property>
<name>hbase.client.scanner.caching</name>
<value>10000</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>30</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.3</value>
</property>
<property>
<name>hbase.storescanner.parallel.seek.enable</name>
<value>true</value>
</property>
<property>
<name>hbase.storescanner.parallel.seek.threads</name>
<value>20</value>
</property>
<property>
<name>hbase.regionserver.wal.codec</name>
<value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
</property>
<property>
<name>phoenix.query.timeoutMs</name>
<value>7200000</value>
</property>
<property>
<name>hbase.hregion.max.filesize</name>
<value>10737418240</value>
</property>
<property>
<name>hbase.thrift.connection.max-idletime</name>
<value>1800000</value>
</property>
<property>
<name>hbase.client.write.buffer</name>
<value>20971520</value>
</property>
<property>
<name>hbase.rpc.timeout</name>
<value>7200000</value>
</property>
<property>
<name>phoenix.schema.dropMetaData</name>
<value>false</value>
</property>
<property>
<name>phoenix.query.keepAliveMs</name>
<value>7200000</value>
</property>
<property>
<name>hbase.regionserver.lease.period</name>
<value>7200000</value>
</property>
<property>
<name>hbase.client.scanner.timeout.period</name>
<value>7200000</value>
</property>
I would like to know what else can be tuned to improve the query response time which is > 10 mins now.
Query is given below
select catColumn, sum(amount) from "BigTable" where timeStamp between <startTime> and <endTime> group by catColumn;
But the consolidation is done at spark, using DataFrame functions