1

I am using Snappydata and SQL to run some analysis, however the job is slow and involves join operations on very large input data.

I am considering partition the input data first, then run the jobs on different partitions at the same time to speed up the process. But in the embedded mode I am using, my code gets the SnappySession passed in, and I can use bin/snappy-sql to query the tables, So I assume all snappydata jobs would share the same SnappySession (or same table namespace, like the same database in Postgresql in my understanding).

So I assume if I submit my job using the same jar with different input arguments, the tables namespace would be the same for different jobs, thus causing errors.

So my question is: is it possible to have multiple snappySession (or multiple namespace like database names) that run a series of operations independently, preferably in one snappydata job to avoid managing many jobs at the same time?

user3230153
  • 123
  • 3
  • 11

1 Answers1

1

I am not sure I follow the question. Maybe this will help:

When queries are submitted using snappy-sql this shell uses JDBC to connect and run the query. Internally snappy will start a Job and run concurrent tasks on each partition depending on the query. And, yes, this SQL session internally is associated with a unique SnappySession (spark session).

Or, maybe, you are trying to partition the data across many tables and start processing on these tables independently but in parallel ?

jagsr
  • 535
  • 2
  • 6
  • Thanks very much for the reply, So it seems that the creating multiple tables to split the data and run each sql in parallel is the way to go. I was hoping to reuse my existing code as it is, but I guess I'll need to modify my code now. – user3230153 Sep 22 '17 at 19:02