I've set-up a simple Spark-ML app, where I have a pipeline of independent transformers that add columns to a dataframe of raw data. Since the transformers don't look at the output of one another I was hoping I could run them in parallel in a non-linear (DAG) pipeline. All I could find about this feature is this paragraph from the Spark ML-Guide:
It is possible to create non-linear Pipelines as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the Pipeline forms a DAG, then the stages must be specified in topological order.
My understanding of the paragraph is that if I set the inputCol(s), outputCol parameters for each transformer and specify the stages in topological order when I create the pipeline, then the engine will use that information to build an execution DAG s.t. the stages of the DAG could run once their input is ready.
Some questions about that:
- Is my understanding correct?
- What happens if for one of the stages/transformers I don't specify an output column (e.g. the stage only filters some of the lines)?. Will it assume that for DAG creation purposes the stage is changing all columns so all subsequent stages should be waiting for it?
- Likewise, what happens if for one of the stages I don't specify an inputCol(s)? Will the stage wait until all previous stages are complete?
- It seems I can specify multiple input columns but only one output column. What happens if a transformer adds two columns to a dataframe (Spark itself has no problem with that)? Is there some way to let the DAG creation engine know about it?