I am joining two table with large number (currently 100M - 1B) of rows in SnappyData configured in one server with 64 CPU cores and 512GB of memory, and would like to utilize the collocated join . However the description in the doc seems to imply that collocated join happens at per-node level.
What I need is actually something like per-bucket level collocated join (or partitioned-join), and I am not utilizing the server to the fullest as the total CPU utilization is about 10% or less most of the time.
For the join I am using the Rowstore and SQL statements. And I set up SnappyData using the single node setup script(snappy-start-all.sh), with one lead, one server, and one locator, with a bit customization to use more mem and CPU core:
conf/leads
localhost -locators=localhost:9999 -heap-size=200g -spark.executor.cores=30 -dir=/snappydata/lead1
conf/servers
localhost -locators=localhost:9999 -heap-size=200g -spark.executor.cores=30 -dir=/snappydata/server1
conf/locators
localhost -peer-discovery-port=9999 -heap-size=2048m -dir=/snappydata/locator1
I wasn't able to figure out how to solve this problem, and my questions are:
- How does collocated join work exactly? Does the collocated join happens at per-node level or per-bucket level or something between? Is there anything I can tweak about this besides setting the bucket number?
- Would creating index over the joining columns help in addition to the collocated join?
- Is the configuration a good fit for my workload? or should I set multiple servers to fully utilize the resource?
- If my setting seems OK, the low CPU utilization might be due to the skewed hash partitioning scheme, where can I tweak the partitioning scheme to avoid skewed partitioning?
Any information or pointers to any of the above questions (sorry about asking a lot of questions in just one post) would be be greatly appreciated :)
UPDATE:
The schema of the two row table looks like this (columns are all Integer type):
Table_A(key1, key2, value1, value2, value3)
USING ROW OPTIONS (partition_by 'key1, key2')
Table_B(key1, key2, value4, value5)
USING ROW OPTIONS (partition_by 'key1, key2', colocate_with 'Table_A').
The join result will contain: Table_C(key1, key2, value1, value2, value3, value4, value5)
And the key1 can be of ~200 different values, key2 ~2M different values. And (key1, key2) distribution are skewed and not unique -- a few(<5%) quite frequent, while most majority are not.