0

I am using Apache Phoenix to run some queries but their performance look bad compared to what I was expecting. As an example, considering a table like:

CREATE TABLE MY_SHORT_TABLE ( 
MPK BIGINT not null,
... 38 other columns ...
CONSTRAINT pk PRIMARY KEY (MPK, 4 other columns))
SALT_BUCKETS = 4;

which has arround 460000 lines, a query like:

select sum(MST.VALUES),
           MST.III, MST.BBB, MST.DDD, MST.FFF,
           MST.AAA, MST.CCC, MST.EEE, MST.HHH
      from 
           MY_SHORT_TABLE MST
      group by 
           MST.AAA, MST.BBB, MST.CCC, MST.DDD, 
           MST.EEE, MST.FFF, MST.HHH, MST.III

is taking arround 9-11 seconds to complete. In a table with a similar structure but with near 3 400 000 lines, it takes arroud 45 seconds to complete the query.

I have 5 hosts (1 Master and 4 RegionServer+PhoenixQS) in this cluster with 6 vCPU and 32GB RAM.

The configurations I am using at in this example are:

HBase RegionServer Maximum Memory=8192(8GB)
HBase Master Maximum Memory=8192(8GB)
Number of Handlers per RegionServer=30
Memstore Flush Size=128MB
Maximum Record Size=1MB
Maximum Region File Size=10GB
% of RegionServer Allocated to Read Buffers=40%
% of RegionServer Allocated to Write Buffers=40%

HBase RPC Timeout=6min
Zookeeper Session Timeout=6min
Phoenix Query Timeout=6min

Number of Fetched Rows when Scanning from Disk=1000
dfs.client.read.shortcircuit=true
dfs.client.read.shortcircuit.buffer.size=131072
phoenix.coprocessor.maxServerCacheTimeToLiveMs=30000

I am using HDP 2.4.0, so Phoenix 4.4.

The example query explain is the following:

+------------------------------------------+
|                   PLAN                   |
+------------------------------------------+
| CLIENT 8-CHUNK PARALLEL 8-WAY FULL SCAN OVER MY_SHORT_TABLE |
|     SERVER AGGREGATE INTO DISTINCT ROWS BY [AAA, BBB, CCC, DDD, EEE, FFF, HHH |
| CLIENT MERGE SORT                        |
+------------------------------------------+

Also, I have created an index as:

CREATE INDEX i1DENORM2T1 ON MY_SHORT_TABLE (HHH) 
INCLUDE ( AAA, BBB, CCC, DDD, EEE, FFF, HHH, VALUES ) ;

This index changes the query execution plan to:

+------------------------------------------+
|                   PLAN                   |
+------------------------------------------+
| CLIENT 4-CHUNK PARALLEL 4-WAY FULL SCAN OVER I1DENORM2T1 |
|     SERVER AGGREGATE INTO DISTINCT ROWS BY ["AAA", "BBB", "DDD", "EEE", "FFF", "HHH |
| CLIENT MERGE SORT                        |
+------------------------------------------+

However the performance do not match the expectations (arround 3-4 seconds).

What is wrong in the above configs or what should I change in order to get a better performance?

Thanks in advance.

ssobreiro
  • 11
  • 1
  • 6
  • you might want to look at query hints and `explain sql-query` to determine what's going on with the query. – Paul Bastide Jul 25 '17 at 13:08
  • Hi Paul. Regarding hints, the idea is to avoid them... it's a limitation I have. I will edit my question to add the explaination of the example query. Thanks – ssobreiro Jul 25 '17 at 16:29
  • interesting that you are using a mergesort instead of a hash-join, https://phoenix.apache.org/joins.html -- you are the extend of my knowledge on indexing in phoenix – Paul Bastide Jul 26 '17 at 11:45
  • Hi. Yes, somehow it is using the merge-sort even without the hint. I tried to do the exact same query adding a filter for a specific value in the, lets say, MST.AAA column. This MST.AAA is a PK column and the query plan changed to a range scan over the index (which makes sense). This way the query returned in 0.2 seconds! Is there any other way to get the data this fast without this filtering? – ssobreiro Jul 26 '17 at 13:34

0 Answers0