0

I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've partitioned my data with time bucket and one id, so based on queries i need to union multiple partitions with spark-sql and do the aggregations/group-by on union-result, something like this:

for(all cassandra partitions){
DataSet<Row> currentPartition = sqlContext.sql(....);
unionResult = unionResult.union(currentPartition);
}

Increasing input (number of loaded partitions), increases response time more than linearly because unions would be done sequentialy.

Because there is no harm in doing unions in parallel, and i dont know how to force spark to do them in parallel, Right now i'm using a ThreadPool to Asyncronosly load all partitions in my application (which may cause OOM), and somehow do the sort or simple group by in java (Which make me think why even i'm using spark at all?)

The short question is: How to force spark-sql to load cassandra partitions in parallel while doing union on them? Also I don't want too many tasks in spark, with my Home-Made Async solution, i use coalesece(1) so one task is so fast (only wait time on casandra).

f.ald
  • 320
  • 1
  • 16
  • Possible duplicate of [Spark union of multiple RDDs](https://stackoverflow.com/questions/33743978/spark-union-of-multiple-rdds) – 10465355 Nov 18 '18 at 10:40
  • It said that "It is a matter on convenience.", but performance is my goal and i want to retrieve all Cassandra partitions simultaneously and union them Asynchronously which obviously order on union does not matter – f.ald Nov 18 '18 at 13:00

0 Answers0