1

I am joining two table with large number (currently 100M - 1B) of rows in SnappyData configured in one server with 64 CPU cores and 512GB of memory, and would like to utilize the collocated join . However the description in the doc seems to imply that collocated join happens at per-node level.

What I need is actually something like per-bucket level collocated join (or partitioned-join), and I am not utilizing the server to the fullest as the total CPU utilization is about 10% or less most of the time.

For the join I am using the Rowstore and SQL statements. And I set up SnappyData using the single node setup script(snappy-start-all.sh), with one lead, one server, and one locator, with a bit customization to use more mem and CPU core:

conf/leads

localhost -locators=localhost:9999 -heap-size=200g -spark.executor.cores=30 -dir=/snappydata/lead1

conf/servers

localhost -locators=localhost:9999  -heap-size=200g -spark.executor.cores=30 -dir=/snappydata/server1

conf/locators

localhost -peer-discovery-port=9999 -heap-size=2048m -dir=/snappydata/locator1

I wasn't able to figure out how to solve this problem, and my questions are:

  1. How does collocated join work exactly? Does the collocated join happens at per-node level or per-bucket level or something between? Is there anything I can tweak about this besides setting the bucket number?
  2. Would creating index over the joining columns help in addition to the collocated join?
  3. Is the configuration a good fit for my workload? or should I set multiple servers to fully utilize the resource?
  4. If my setting seems OK, the low CPU utilization might be due to the skewed hash partitioning scheme, where can I tweak the partitioning scheme to avoid skewed partitioning?

Any information or pointers to any of the above questions (sorry about asking a lot of questions in just one post) would be be greatly appreciated :)

UPDATE:

The schema of the two row table looks like this (columns are all Integer type):

Table_A(key1, key2, value1, value2, value3) 
  USING ROW OPTIONS (partition_by 'key1, key2')

Table_B(key1, key2, value4, value5) 
  USING ROW OPTIONS (partition_by 'key1, key2', colocate_with 'Table_A').

The join result will contain: Table_C(key1, key2, value1, value2, value3, value4, value5)

And the key1 can be of ~200 different values, key2 ~2M different values. And (key1, key2) distribution are skewed and not unique -- a few(<5%) quite frequent, while most majority are not.

user3230153
  • 123
  • 3
  • 11

2 Answers2

2

Queries that involve two or more partitioned row tables (with or without additional, replicated tables) are supported only if they satisfy both of these colocation criteria:

  1. The query's WHERE clause has equijoin conditions on all of the partitioning columns for all of the partitioned tables in the query.
  2. All partitioned tables in the query are colocated.

These criteria allow SnappyData to distribute a join query to all of the servers that host the data. The query is executed concurrently on each local data store without having to move table data from one member to another to perform the join. Joins are performed on the local data set of each data store, and the main query member aggregates the results from each data store to obtain the result set.

Yogesh Mahajan
  • 241
  • 1
  • 4
  • While the two points are good suggestions to improve query performance, these restrictions no longer apply in SnappyData (only in Pivotal GemFireXD). Snappydata supports full distributed joins regardless of data affinity. – jagsr Apr 19 '17 at 03:43
2
  1. How does collocated join work exactly? Does the collocated join happens at per-node level or per-bucket level or something between? Is there anything I can tweak about this besides setting the bucket number?

collocated join happens per node level.

  1. Would creating index over the joining columns help in addition to the collocated join?

This depends on your join query.

  1. Is the configuration a good fit for my workload? or should I set multiple servers to fully utilize the resource?

Setting up multiple servers should help utilize the resources better.

  1. If my setting seems OK, the low CPU utilization might be due to the skewed hash partitioning scheme, where can I tweak the partitioning scheme to avoid skewed partitioning?

You can start with this pointer - http://rowstore.docs.snappydata.io/docs/data_management/partitioning-about.html You can also partition your data based on range or list or expression depending on nature of your data. Go through these examples for how to achieve this - http://rowstore.docs.snappydata.io/docs/developers_guide/topics/create-partition-tables/PartitioningExamples.html

Yogesh Mahajan
  • 241
  • 1
  • 4