0

I am using apache hbase (version 1.0.0) and phoenix (version 4.6) deployed through cloudera. Since my aggregations with group by query is slow, I want to try out disabling the block cache for a particular hbase table. I tried a couple of approaches but couldn't succeed. I am verifying if the block cache is enabled/disabled through the hbase shell - describe 'my_table' command.

  1. Set 'hfile.block.cache.size' property from cloudera console to 0
  2. alter 'myTable', CONFIGURATION => {NAME => 'myColumnFamily', BLOCKCACHE => 'false'}

Even after the above two steps, I get BLOCKCACHE => 'true' in the describe command output

COLUMN FAMILIES DESCRIPTION
{NAME => '0', DATA_BLOCK_ENCODING => 'FAST_DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', MIN_VERSIONS => '0', TTL => 'FOREVER',
KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

What could be going wrong? Any suggestions?

Thank you.

Andy Dufresne
  • 6,022
  • 7
  • 63
  • 113

3 Answers3

1

Instead of

alter 'myTable', CONFIGURATION => {NAME => 'myColumnFamily', BLOCKCACHE => 'false'}

Try doing:

alter 'myTable', {NAME => 'myColumnFamily', BLOCKCACHE => 'false'}
0

The minimum of phoenix.query.maxGlobalMemoryPercentage and phoenix.query.maxGlobalMemorySize control the max memory for intermediate results when doing group-by aggregation. Modifying this limit (default 15%) may improve aggregation performance.

kliew
  • 3,073
  • 1
  • 14
  • 25
  • Theoretically, does it make sense to have block cache enabled for aggregations and group by queries? Assuming it would lead to a full table scan, I think it would hamper the performance if block cache is enabled. Consider a rollup query like - SELECT SUM(UNIT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY COUNTRY; What do you think? – Andy Dufresne Mar 25 '16 at 08:01
0

I understand the problem more clearly now. Phoenix implements coprocessors in HBase to execute queries in parallel on HBase servers.

If the query plan consists of only a single table scan, there may be no need to cache (maxGlobalMemory cache). But Phoenix does not currently support ROLLUP. For a GROUP BY, Phoenix will store the intermediate Map (returned by the coprocessors) in memory or spill to disk (behaviour defined by phoenix.query.maxGlobalMemory) before producing the final result. It is better to have it in memory. The HBase block cache settings take effect as the coprocessor executes on each HBase node (since you only expect to read this table once, it makes sense to disable block cache for these column families to reduce garbage collection overhead and eviction of useful blocks).

Thus the two settings are complementary. You may wish to disable HBase block cache while tuning the maxGlobalMemory cache.

Regarding the settings not taking effect, http://hbase.apache.org/book.html#config.files

Currently, changes here will require a cluster restart for HBase to notice the change.

kliew
  • 3,073
  • 1
  • 14
  • 25
  • I found the way to disable to block cache. By executing the alter table query in phoenix I was able to do so. I tried the maxGlobalMemory setting but that didn't make any difference in the query timings. Btw, you mention about ROLLUP, How different is it to support ROLLUP clause vs this query "SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;" . This query is rolling up the sales by country right? – Andy Dufresne Mar 29 '16 at 05:04
  • I misunderstood your earlier mention about ROLLUP. There may be queries that generate the same query/execution plan as a ROLLUP statement, but Phoenix does not support the keyword ROLLUP in queries, which is what I was referring to. But I agree, disabling block cache seems appropriate. Did it improve performance in this case? – kliew Mar 29 '16 at 09:41