1

We are working on Qubole with Spark version 2.0.2.

We have a multi-step process in which all the intermediate steps write their output to HDFS and later this output is used in the reporting layer.

As per our use case, we want to avoid writing to HDFS and keep all the intermediate output as temporary tables in spark and directly write the final reporting layer output.

For this implementation, we wanted to use Job server provided by Qubole but when we try to trigger multiple queries on the Job server, Job server is running my jobs sequentially.

I have observed the same behavior in Databricks cluster as well.

The cluster we are using is a 30 node, r4.2xlarge.

Does anyone has experience in running multiple jobs using job server ?

Community's help will be greatly appreciated !

  • read this document hope it's useful for you https://github.com/spark-jobserver/spark-jobserver – Bhupesh May 02 '17 at 13:26
  • Thanks Bhupesh, But I cannot find anything about parallelism in this document. – Hardeep Saluja May 02 '17 at 13:59
  • @BhupeshKushwaha I also can't find any word about parallelism at spark job server. – VB_ May 12 '17 at 12:38
  • @HardeepSaluja have you found any information about this? – VB_ May 12 '17 at 12:39
  • May be you could give a try with https://github.com/streamlyio/streamly-spark-examples/blob/master/streamly-learning-spark/src/main/java/io/streamly/examples/SparkRunJobParallel.java – berrytchaks May 12 '17 at 15:43
  • Thanks for all your replies. But I am still not able to run parallel queries. I have created a python code to do multi-threading similar to what berrytchaks mentioned but that is in Java... But I am worried that if I use multithreading and create hundreds of threads, will my spark context support that ? – Hardeep Saluja May 16 '17 at 09:17
  • Which mode is used to run Spark? There's only FIFO scheduler for Standalone mode – morsik Mar 26 '20 at 08:55

0 Answers0