1

Lets say I have 20 SnappyData nodes.

And, I have a table like this:

example_timeseries_table

id int not null,
value varchar(128) not null,
time timestamp not null
foo varchar(128) not null,
PARTITION BY COLUMN time

And, I make the query:

select sum(value) from example_timeseries_table where foo = 'xyz'

Does SnappyData call out to all 20 nodes, collect the results and then return the response? If so, how would I improve how to make this type of query? Should the table be designed differently to make "foo" a PARTITION column?

Jason
  • 2,006
  • 3
  • 21
  • 36

2 Answers2

1

SnappyData tries to distribute its data as uniformly as possible across all data nodes. The unit of distribution is table buckets. So to answer your question, yes, SnappyData will call all 20 nodes to get the result. This will achieve parallel processing and will be good if the data size is big. However you should configure the number of buckets of a table by finding a prime number near to the total available processors in the cluster. (there is an existing JIRA ticket tracking the ability for this to happen automatically)

Also we are working on a feature which will do partition pruning if a partition column is in the predicate.

Having a partition column for a simple table scan might not be very useful. But if you plan to add join queries later on, this will give a big boost to the query performance as we try to avoid Spark shuffle if the join condition is on a partition column.

plamb
  • 5,636
  • 1
  • 18
  • 31
1

The full scan is the case today when using Column tables. Pretty much how the Spark catalyst engine executes queries over dataframes. Note two things:

  1. Use Row tables for such selective queries: Row tables supports both partition pruning (query pruned to node that manages 'foo') as well as indexing.
  2. Indexing column tables in Work-in-progress
jagsr
  • 535
  • 2
  • 6