0

I would like to do colocated joins on two tables in SnappyData, and in order to further speedup the join, would it help if I also create indexes on the joining columns of the two tables?

More specifically, the two tables would be quite large, and it would be ideal for me if on the large scale it's pair-wise partitioned join, and in each pair of partitions, the index nested loop join can be used instead of naive nested loop join.

I wasn't able to find example or tutorial for this and any explanation or pointers would be greatly appreciated.

Thanks in advance!

Update:

The the two tables are large in terms of #rows, and the tables have very few columns (3 - 4 columns, all integer types):

`Table1(Col_A, Col_B), Table2(Col_B, Col_C)`,

and I would like to join Table1 & Table2 on Col_B to get result like

Table3(Col_A, Col_B, Col_C), 

thus I would prefer horizontally partitioning (using row tables) the two joining tables on the joining column Col_B, instead of using column tables. And use co-located join to reduce data shuffling.

Even after partitioning, the partitions might still be too large, thus I'm wondering if I can create indexes in each partition independently on Col_B and use it for index join? As it seems to me that in SnappyData I can only create index on the whole column instead of for each partition independently.

user3230153
  • 123
  • 3
  • 11

1 Answers1

0

You said the data is large. So, best to use column tables. Have you tried creating the tables with the partition by/colocateWith clauses?

Example: Table 1: create table T1 (a1, a2, ...) using column partition_by 'a1' Table 2: create table T2 (a1, ax, ay ...) using column partition by 'a1' colocate_with 'T1'

Then, try your query and see the SQL physical plan in the UI (the joins should NOT involve a shuffle). Let us know how it goes.

jagsr
  • 535
  • 2
  • 6