SnappyData multiple jobs to achieve parallelism

Question

I am using Snappydata and SQL to run some analysis, however the job is slow and involves join operations on very large input data.

I am considering partition the input data first, then run the jobs on different partitions at the same time to speed up the process. But in the embedded mode I am using, my code gets the SnappySession passed in, and I can use bin/snappy-sql to query the tables, So I assume all snappydata jobs would share the same SnappySession (or same table namespace, like the same database in Postgresql in my understanding).

So I assume if I submit my job using the same jar with different input arguments, the tables namespace would be the same for different jobs, thus causing errors.

So my question is: is it possible to have multiple snappySession (or multiple namespace like database names) that run a series of operations independently, preferably in one snappydata job to avoid managing many jobs at the same time?

score 1 · Answer 1 · answered Sep 21 '17 at 18:19

1

I am not sure I follow the question. Maybe this will help:

When queries are submitted using snappy-sql this shell uses JDBC to connect and run the query. Internally snappy will start a Job and run concurrent tasks on each partition depending on the query. And, yes, this SQL session internally is associated with a unique SnappySession (spark session).

Or, maybe, you are trying to partition the data across many tables and start processing on these tables independently but in parallel ?

answered Sep 21 '17 at 18:19

jagsr

535
2
6

Thanks very much for the reply, So it seems that the creating multiple tables to split the data and run each sql in parallel is the way to go. I was hoping to reuse my existing code as it is, but I guess I'll need to modify my code now. – user3230153 Sep 22 '17 at 19:02

SnappyData multiple jobs to achieve parallelism

1 Answers1