1

I got 10 DataFrames with the same schema which I'd like to combine into one DataFrame. Each DataFrame is constructed using a sqlContext.sql("select ... from ...").cahce, which means that technically, the DataFrames are not really calculated until it's time to use them.

So, if I run:

val df_final = df1.unionAll(df2).unionAll(df3).unionAll(df4) ...

will Spark calculate all these DataFrames in parallel or one by one (due to the dot operator)?

And also, while we're here - is there a more elegant way to preform a unionAll on several DataFrames than the one I listed above?

shakedzy
  • 2,853
  • 5
  • 32
  • 62
  • Regarding last part see http://stackoverflow.com/a/37612978/1560062. If it happens "in parallel"? It depends on what you mean by parallel as well as available resources and data. – zero323 Aug 12 '16 at 13:21
  • @zero323 is it happening asynchronously and non-blocking, assuming it has enough resources to handle it? – shakedzy Aug 12 '16 at 13:42
  • I think that Daniel pretty much answered this question :) – zero323 Aug 12 '16 at 14:52

1 Answers1

3

unionAll is lazy. The example line in your question does not trigger any calculation, synchronous or asynchronous.

In general Spark is a distributed computation system. Each operation itself is made up of a bunch of tasks that are processed in parallel. So in general you don't have to worry about whether two operations can run in parallel or not. The cluster resources will be well utilized anyway.

Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114