How to speed up a slow multi-join query in a small Vertica database (~120K total rows, 10 minutes)

Question

I'd love to get your help understanding why this join-heavy query takes about 10 minutes to run on a small database of seven tables totaling < 120K rows, and ideally get your suggestions on how to make it faster on our little cluster of four nodes. I've put supporting information here: https://gist.github.com/anonymous/8862796 (list of tables, list of fields by table, and table sizes), but following is the query and the EXPLAIN VERBOSE output. I ran ANALYZE_WORKLOAD() on this query and then followed its suggestion to run ANALYZE_STATISTICS on all tables. This resulted in no improvement. I then did its second suggestion of running the Database Designer, which resulted in even slower performance. I'd very much appreciate your help.

Profile information

Thanks to the tips below re: PROFILE. I ran it and put the results here: https://gist.github.com/anonymous/8935190 . It is 8K lines long, so perhaps I didn't run it correctly (details in the gist). Question: How do I start analyzing it?

Query backstory

The reason the query is messy is mainly because it's dynamically generated for each run of our machine learning research software that has to walk paths graph-style through the E-R tables involved, applying various conditions. In this case the path is [rates, movie, rates, ml_user, rates, movie, rates]. Queries are progressively built up during the program's exploration of a solution space, which is why there (currently) aren't the optimizations folks have kindly and rightly suggested by @wumpz and @Bohemian below, e.g., eliminating subselects. This means I'm kinda' stuck with its current form for the short term :-/

 ------------------------------ 
 QUERY PLAN DESCRIPTION: 
 ------------------------------

 Opt Vertica Options
 --------------------
 PLAN_OUTPUT_SUPER_VERBOSE


 EXPLAIN VERBOSE
 SELECT relVarTable0.id AS id, relVarTable1.val, relVarTable2.val 
 FROM (SELECT id FROM rates) relVarTable0
 LEFT JOIN
 (SELECT rates1.id AS id, AVG(rates4.rating) AS val 
 FROM rates rates1, movie movie1, rates rates2, ml_user ml_user1, rates rates3, movie movie2, rates rates4 
 WHERE movie1.id = rates1.movie_id AND movie1.id = rates2.movie_id AND ml_user1.id = rates2.ml_user_id AND ml_user1.id = rates3.ml_user_id AND movie2.id = rates3.movie_id AND movie2.id = rates4.movie_id AND movie1.id <> movie2.id AND rates1.id <> rates2.id AND rates2.id <> rates3.id AND rates3.id <> rates4.id AND rates4.rating IS NOT NULL 
 GROUP BY rates1.id) relVarTable1
  ON relVarTable0.id = relVarTable1.id
 LEFT JOIN
 (SELECT rates1.id AS id, rates1.rating AS val 
 FROM rates rates1 
 WHERE rates1.rating IS NOT NULL ) relVarTable2
  ON relVarTable0.id = relVarTable2.id;

 Access Path:
 Sort Key: (V(1,1))
  LDISTRIB_UNSEGMENTED
 +-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4489.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5441368.000000 Memory(B): 1209184.000000 Netwrk(B): 1209184.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 1) Inner (RESEGMENT)
 |  Join Cond: (relVarTable0.id = relVarTable2.id)
 |  Execute on: All Nodes
 |  Sort Key: (V(1,1))
 |  LDISTRIB_UNSEGMENTED
 | +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4197.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 1369200.000000 Memory(B): 0.000000 Netwrk(B): 604600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 2) Outer (RESEGMENT)
 | |      Join Cond: (relVarTable0.id = relVarTable1.id)
 | |      Execute on: All Nodes
 | |      Sort Key: (V(1,1))
 | |      LDISTRIB_UNSEGMENTED
 | | +-- Outer -> SELECT [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 3)
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | |      LDISTRIB_UNSEGMENTED
 | | | +---> STORAGE ACCESS for rates [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 4)
 | | | |      Column Cost Aspects: [ Disk(B): 196608.000000 CPU(B): 0.000000 Memory(B): 604600.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | |      Projection: movielens_test.rates_b0
 | | | |      Materialize: rates.id
 | | | |      Execute on: All Nodes
 | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | |      LDISTRIB_SEGMENTED
 | | +-- Inner -> SELECT [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 5)
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id)
 | | |      LDISTRIB_UNSEGMENTED
 | | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 6650600.000000 Memory(B): 640000.000000 Netwrk(B): 6890600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 6)
 | | | |      Aggregates: sum_float(<SVAR>), count(<SVAR>)
 | | | |      Group By: rates1.id
 | | | |      Execute on: All Nodes
 | | | |      Sort Key: (rates.id)
 | | | |      LDISTRIB_SEGMENTED
 | | | | +---> JOIN HASH [Cost: 2869.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 12091944.000000 Memory(B): 3022960.000000 Netwrk(B): 1813776.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 88] (PATH ID: 7) Inner (RESEGMENT)
 | | | | |      Join Cond: (movie2.id = rates4.movie_id)
 | | | | |      Join Filter: (rates3.id <> rates4.id)
 | | | | |      Execute on: All Nodes
 | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | |      LDISTRIB_UNSEGMENTED
 | | | | | +-- Outer -> JOIN HASH [Cost: 2395.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 9110592.000000 Memory(B): 41592.000000 Netwrk(B): 4246064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 64] (PATH ID: 8) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | |      Join Cond: (movie2.id = rates3.movie_id)
 | | | | | |      Join Filter: (movie1.id <> movie2.id)
 | | | | | |      Execute on: All Nodes
 | | | | | |      Runtime Filter: (SIP1(HashJoin): movie2.id)
 | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | +-- Outer -> JOIN HASH [Cost: 1625.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 10278200.000000 Memory(B): 3023000.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 56] (PATH ID: 9) Inner (RESEGMENT)
 | | | | | | |      Join Cond: (ml_user1.id = rates3.ml_user_id)
 | | | | | | |      Join Filter: (rates2.id <> rates3.id)
 | | | | | | |      Execute on: All Nodes
 | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | |      LDISTRIB_UNSEGMENTED
 | | | | | | | +-- Outer -> JOIN HASH [Cost: 1163.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5582544.000000 Memory(B): 141144.000000 Netwrk(B): 2465448.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 10) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | | | |      Join Cond: (ml_user1.id = rates2.ml_user_id)
 | | | | | | | |      Execute on: All Nodes
 | | | | | | | |      Runtime Filter: (SIP2(HashJoin): ml_user1.id)
 | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | +-- Outer -> JOIN HASH [Cost: 711.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 8464400.000000 Memory(B): 2418400.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 32] (PATH ID: 11) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
 | | | | | | | | |      Join Cond: (movie1.id = rates2.movie_id)
 | | | | | | | | |      Join Filter: (rates1.id <> rates2.id)
 | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates2 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 12)
 | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | | | |      Materialize: rates2.id, rates2.ml_user_id, rates2.movie_id
 | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | +-- Inner -> JOIN HASH [Cost: 268.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 3064592.000000 Memory(B): 41592.000000 Netwrk(B): 1223064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 13) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | | | | | |      Join Cond: (movie1.id = rates1.movie_id)
 | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates1 [Cost: 39.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 14)
 | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 0.000000 Memory(B): 1209200.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | | | | |      Materialize: rates1.id, rates1.movie_id
 | | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | | +-- Inner -> STORAGE ACCESS for movie1 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 15)
 | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | | |      Projection: movielens_test.movie_b0
 | | | | | | | | | | |      Materialize: movie1.id
 | | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
 | | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | +-- Inner -> STORAGE ACCESS for ml_user1 [Cost: 5.000000, Rows: 5881.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 16)
 | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 47048.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | |      Projection: movielens_test.ml_user_b0
 | | | | | | | | |      Materialize: ml_user1.id
 | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | |      Sort Key: (ml_user.id, ml_user.gender, ml_user.age_range, ml_user.occupation, ml_user.zipcode, ml_user.num_ratings)
 | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | +-- Inner -> STORAGE ACCESS for rates3 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 17)
 | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | |      Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id
 | | | | | | | |      Execute on: All Nodes
 | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | +-- Inner -> STORAGE ACCESS for movie2 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 18)
 | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | |      Projection: movielens_test.movie_b0
 | | | | | | |      Materialize: movie2.id
 | | | | | | |      Execute on: All Nodes
 | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
 | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | +-- Inner -> STORAGE ACCESS for rates4 [Cost: 60.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PUSHED GROUPING) Partial GroupBy: rates4.movie_id,rates4.id Partial Aggs: sum_float(<SVAR>),count(<SVAR>) (PATH ID: 19)
 | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 196608.000000 Memory(B): 1813784.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | |      Projection: movielens_test.rates_b0
 | | | | | |      Materialize: rates4.rating, rates4.id, rates4.movie_id
 | | | | | |      Filter: (rates4.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
 | | | | | |      Execute on: All Nodes
 | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | |      LDISTRIB_SEGMENTED
 | +-- Inner -> SELECT [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 20)
 | |      Execute on: All Nodes
 | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | |      LDISTRIB_UNSEGMENTED
 | | +---> STORAGE ACCESS for rates1 [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 21)
 | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 196608.000000 Memory(B): 1209184.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | |      Projection: movielens_test.rates_b0
 | | |      Materialize: rates1.rating, rates1.id
 | | |      Filter: (rates1.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | |      LDISTRIB_SEGMENTED


 ------------------------------

I think your problem comes from your use of subselects. Why don't you use e.g. "select .. from rates .." instead of "select .. from (select id from rates) ..."? You are makeing this statement very hard to optimize for your database system. — wumpz, Feb 10 '14 at 17:48
Have you run database designer on these tables and/or fed this query into DBD? You may also want to look into pre-join projections. — Kermit, Feb 10 '14 at 18:13
i see nothing heavy !! kkkk pass me the data load and i drop this query to a 1 sec in no time ! — Up_One, Feb 10 '14 at 20:21
Thanks all - I've added some history about why the query looks like it does. — Matthew Cornell, Feb 11 '14 at 12:56
@FreshPrinceOfSO Thank you. I ran Database Designer before I posted, which actually slowed down the query. I've started diving into understanding Vertica projections to see what's going on. — Matthew Cornell, Feb 11 '14 at 12:57
Your other option is to PROFILE the query. This will break down each step for you based on the execution plan paths and should help you identify the bottle neck. — Kermit, Feb 11 '14 at 13:00
Another comment, you may need to demoralize your schema to get better performance since vertica is not optimized for a full relational schema. — Kermit, Feb 11 '14 at 13:06
FreshPrinceOfSO (and @Up_One below) Thanks for the pointer to PROFILE! I thought it was a general term meaning examining EXPLAIN output. I'll post the results. — Matthew Cornell, Feb 11 '14 at 13:10

score 7 · Answer 1 · edited Jul 07 '15 at 15:17

7

First I see too many NO STATISTICS in your explain plan. That's a bad idea and you should fix it.

See the order of your tables in the join? A hash join was created and you're doing a full tbl scan on your biggest table. Fix this by doing hash join(small table join big table) and not (big table join small table).

Run DBD
Run analyze
Run an explain on it and make sure projections are used to answer your query
Check if your movielens_test.rates can be partitioned
- If running o a single node(MPP) wont be used as that is a big win
- Run profile on your query and post it
- Make sure you have the right data encoding applied on your DDL as well the order of your order by cols to address better your used predicated (I think this will be done by DBD - I always check)

As a final note that I always do:

Open your database log and watch it as you run your query. If you have data spilled on disk this can be your problem as your sort data is bigger then your allocated memory.

Another option is for you to create pre-joined projection on your first sub-query. But only if your data does not suffer many data changes as pre-joined projections are very bad with data loading on them.

edited Jul 07 '15 at 15:17

SuperBiasedMan

9,814
10
45
73

answered Feb 10 '14 at 19:59

Up_One

5,213
3
33
65

I will definitely follow your thoughtful steps. I ran ANALYZE_STATISTICS before I posted, which made no improvement. Like you, I thought to check the plan for that, but it still says NO STATISTICS. I don't understand that. I wonder if running statistics has to be done post-DBD. Re: partitioning, the queries need the whole table for the join, so I don't see the advantage. Re: Single-node, ours is four, not one. I'll check the encodings - also new to me. Re: log, do you mean files like vertica_odbc_conn_1.log? I have started playing with pre-join projections - good tip. – Matthew Cornell Feb 11 '14 at 12:58
You have the option on running DBD to run the analyze !- If you run DBD without analyze ,make sure you run analyze after you create your projections! – Up_One Feb 11 '14 at 19:46
As for the log i mean the "catalog-path/database-name/node-name_catalog/vertica.log" - to see it's actions! – Up_One Feb 11 '14 at 19:51
can you place the explain plan as well ? – Up_One Mar 14 '14 at 18:43

score 1 · Answer 2 · answered Oct 06 '16 at 14:44

Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id -- i see this materialize happening on these column.

and run time filter is also not good, i can see many

Runtime Filter: (SIP2(HashJoin): ml_user1.id). try to use this in your where claus and check your projection being used in the query(you can check it from projection_usage table)

if you are joining two tables , please ensure projection are designed and segmented by hash on high cardinal columns and try to use them in your group by (if any).

if you see any RESGMENTED_MANY_ROWS. PLEASE USE SAME DATA TYPE COLUMNS IN YOUR PREDICATE/WHERE CLAUSE e.g. e.empid (number)=f.empid(number) data type can be integer but can't be float . (just an example) when you run this query please take the session id and transaction id and run below query select * from query_profiles where transaction_id<> and statement_id;

How to speed up a slow multi-join query in a small Vertica database (~120K total rows, 10 minutes)

2 Answers2