Hbase & spark takes more than 10 mins to read and consolidate 1 million rows out of 100 million

Question

I am doing a performance benchmark of Hbase and Spark, and stuck in a position where Spark takes >10 mins to read and consolidate 1 million rows out of 97 million rows present. Below are the details about configurations and data.

Hardware:

Total 4 servers.
1 - Master
3- Slaves
Each server config
CPU: Intel i5 processor
RAM: 16GB
HDD: 1TB (not SSD)

Software:

Hbase - 1.2.1
Spark - 1.6.2
Phoenix - 4.8.1
Hadoop - 2.6

Data size:

Total regions - 20
Total data size: 25 GB
Total rows: 97 million.

We have applied performance tuning configurations that I have understood from hbase guide and internet, below are important configurations from hbase-site.xml

  <property>
    <name>hfile.block.cache.size</name>
    <value>0.4</value>
  </property>

  <property>
    <name>hbase.client.scanner.caching</name>
    <value>10000</value>
  </property>

  <property>
    <name>hbase.regionserver.handler.count</name>
    <value>30</value>
  </property>

  <property>
    <name>hbase.regionserver.global.memstore.upperLimit</name>
    <value>0.3</value>
  </property>


  <property>
    <name>hbase.storescanner.parallel.seek.enable</name>
    <value>true</value>
  </property>



  <property>
    <name>hbase.storescanner.parallel.seek.threads</name>
    <value>20</value>
  </property>

  <property>
    <name>hbase.regionserver.wal.codec</name>
    <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
  </property>

  <property>
     <name>phoenix.query.timeoutMs</name>
     <value>7200000</value>
  </property>

  <property>
     <name>hbase.hregion.max.filesize</name>
     <value>10737418240</value>
  </property>
<property>
<name>hbase.thrift.connection.max-idletime</name>
<value>1800000</value>
</property>
<property>
       <name>hbase.client.write.buffer</name>
       <value>20971520</value>
  </property>
<property>
       <name>hbase.rpc.timeout</name>
       <value>7200000</value>
  </property>

<property>
       <name>phoenix.schema.dropMetaData</name>
       <value>false</value>
  </property>

<property>
       <name>phoenix.query.keepAliveMs</name>
       <value>7200000</value>
  </property>
<property>
       <name>hbase.regionserver.lease.period</name>
       <value>7200000</value>
  </property>
<property>
       <name>hbase.client.scanner.timeout.period</name>
       <value>7200000</value>
  </property>

I would like to know what else can be tuned to improve the query response time which is > 10 mins now.

Query is given below

select catColumn, sum(amount) from "BigTable" where timeStamp between <startTime> and <endTime> group by catColumn;

But the consolidation is done at spark, using DataFrame functions

How to improve my read response time ? I think I have asked it in the last sentence of my question. — Lokesh Kumar P, Oct 21 '16 at 11:43
I mean, you want to improve the performance of your query, but what is the query? select * from table_abc? — Wesley De Keirsmaeker, Oct 21 '16 at 11:50
Ok, its a time range scan query => select catColumn, sum(amount) from "BigTable" where timeStamp between and group by catColumn; But the consolidation is done at spark, using DataFrame functions. — Lokesh Kumar P, Oct 21 '16 at 11:52
This scans the entire table and then does some calculations on top of the overall scan. I'm more familiar with the HBase side than the Spark or Pheonix side of things, but I know that just the HBase stuff is expensive. 10 minutes seems like there's room for improvement in the Spark/Phoenix side of things, but I'd be surprised if you can get this to be that much quicker. — Solomon Duskis, Oct 26 '16 at 15:28
Thanks for reply @Solomon, actually the above query does not do a full scan, but does a range scan, as the timeStamp in query directly refers to the Hbase row timeStamp. But after creating secondary indexes on timeStamp through phoenix, the query time came down to 2 secs. But creating indexes has downside as we cannot use Hbase bulk loads through HFiles, but should use Phoenix upserts which will affect the write throughput, to workaround this problem I am inserting indexes also through HFiles into Hbase. — Lokesh Kumar P, Oct 26 '16 at 15:54
Core HBase only has a single index, and that's a primary index on the rowKey. As you found out, efficient timestamp filtering requires a secondary index, which is not available out of the box with HBase, but rather requires some other mechanism to enable fast lookups. In your case, it seems like phoenix secondary indexes did the job. — Solomon Duskis, Oct 26 '16 at 17:29
Doesn't the row timeStamp scan do the job, I mean it will still skip the full scan right? — Lokesh Kumar P, Oct 26 '16 at 17:35

Hbase & spark takes more than 10 mins to read and consolidate 1 million rows out of 100 million

0 Answers0