The other two answers are correct. However, here's another approach using Scala Futures.
You can find an elaborate explanation here, about this approach: http://www.russellspitzer.com/2017/02/27/Concurrency-In-Spark/
You will see that the respective output directories are indeed outputted at around the same time, rather than sequentially.
import org.apache.spark.sql.DataFrame
import scala.concurrent.duration.Duration._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.{Await, Future}
def write(df:DataFrame, i:Int) =
df.repartition(5).write.mode("overwrite").avro(s"/path/to/output_$i")
val dataframes = Iterator(df, df2) // replace with a list of your data frames
// This is the line that "executes" the writes in parallel
// Use dataframes.zipWithIndex.map to create an iterator/list of futures
// Use Future.sequence to compose together all of these futures, into one future
// Use Await.result to wait until this "composite" future completes
Await.result(Future.sequence(Iterator(df, df2).zipWithIndex.map{ case (d, i) => Future(write(d, i))}), Inf)
You can set a timeout (other than Inf
), and also batch together sublists of dataframes, if needed, to limit the parallelism.