I'm trying to execute a Spark program with spark-submit
(in particular GATK Spark tools, so the command is not spark-submit
, but something similar): this program accept an only input, so I'm trying to write some Java code in order to accept more inputs.
In particular I'm trying to execute a spark-submit
for each input, through the pipe function of JavaRDD:
JavaRDD<String> bashExec = ubams.map(ubam -> par1 + "|" + par2)
.pipe("/path/script.sh");
where par1
and par2
are parameters that will be passed to the script.sh, which will handle (splitting by "|" ) and use them to execute something similar to spark-submit
.
Now, I don't expect to obtain speedup compared to the execution of a single input because I'm calling other Spark functions, but just to distribute the workload of more inputs on different nodes and have linear execution time to the number of inputs.
For example, the GATK Spark tool lasted about 108 minutes with an only input, with my code I would expect that with two similar inputs it would last something similar to about 216 minutes.
I noticed that that the code "works", or rather I obtain the usual output on my terminal. But in at least 15 hours, the task wasn't completed and it was still executing.
So I'm asking if this approach (executing spark-submit
with the pipe function) is stupid or probably there are other errors?
I hope to be clear in explaining my issue.
P.S. I'm using a VM on Azure with 28GB of Memory and 4 execution threads.