Non linear (DAG) ML pipelines in Apache Spark

Question

I've set-up a simple Spark-ML app, where I have a pipeline of independent transformers that add columns to a dataframe of raw data. Since the transformers don't look at the output of one another I was hoping I could run them in parallel in a non-linear (DAG) pipeline. All I could find about this feature is this paragraph from the Spark ML-Guide:

It is possible to create non-linear Pipelines as long as the data flow graph forms a Directed Acyclic Graph (DAG). This graph is currently specified implicitly based on the input and output column names of each stage (generally specified as parameters). If the Pipeline forms a DAG, then the stages must be specified in topological order.

My understanding of the paragraph is that if I set the inputCol(s), outputCol parameters for each transformer and specify the stages in topological order when I create the pipeline, then the engine will use that information to build an execution DAG s.t. the stages of the DAG could run once their input is ready.

Some questions about that:

Is my understanding correct?
What happens if for one of the stages/transformers I don't specify an output column (e.g. the stage only filters some of the lines)?. Will it assume that for DAG creation purposes the stage is changing all columns so all subsequent stages should be waiting for it?
Likewise, what happens if for one of the stages I don't specify an inputCol(s)? Will the stage wait until all previous stages are complete?
It seems I can specify multiple input columns but only one output column. What happens if a transformer adds two columns to a dataframe (Spark itself has no problem with that)? Is there some way to let the DAG creation engine know about it?

Are you using Spark or develop its core ? how can you specify the order of the stages ? — 54l3d, Jun 03 '16 at 09:02
@54l3d I'm talking about ml pipelines. When I create such a pipeline there's a stages parameter that takes a list of ordered stages (Transformers/Estimators). — hillel, Jun 04 '16 at 07:51
are there any updates on this? I haven't seen any code change / update anywhere :( — walking, Dec 19 '22 at 15:57

zero323 · Accepted Answer · 2016-06-03T20:18:14.347

8

Is my understanding correct?

Not exactly. Because stages are provided in a topological order all you have to do to traverse the graph in the correct order is to apply PipelineStages from left to right. And this exactly what happens when you call PipelineTransform.

Sequence of stages is traversed twice:

once to validate schema using transformSchema which is simply implemented as stages.foldLeft(schema)((cur, stage) => stage.transformSchema(cur)). This is the part where actual schema validation is performed.
once to fit actually transform data using Transformers and fit Estimators. This is just a simple for loop which applies stages sequentially one by one.

Likewise, what happens if for one of the stages I don't specify an inputCol(s)?

Pretty much nothing interesting. Since stages are applied sequentially, and the only schema validation is applied by the given Transformer using its transformSchema method before actual transformations begin, it will processed as any other stage.

What happens if a transformer adds two columns to a dataframe

Same as above. As long as it generates valid input schema for subsequent stages it is not different than any other Transformer.

transformers don't look at the output of one another I was hoping I could run them in parallel

Theoretically you could try to build a custom composite transformer which encapsulates multiple different transformations but the only part that could be performed independently and benefit from this type of operation is model fitting. At the end of the day you have to return a single transformed DataFrame which can be utilized by downstream stages and actual transformations are most likely scheduled as a single data scan anyway.

Question remains if it is really worth the effort. While it possible to execute multiple jobs at the same time, it provides some edge only, if amount of available resources is relatively high compared to amount of work required to handle a single job. It usually requires some low level management (number of partitions, number of shuffle partitions) which is not the strongest suit of Spark SQL.

edited Jun 03 '16 at 20:18

answered Jun 03 '16 at 10:37

zero323

322,348
103
959
935

Thanks. Then IIUC Spark seems to handle non-linear pipelines just like linear pipelines and just runs the stages in order. Any guess what's the point of that paragraph from the manual then? it seems that the graph/chain is just specified by the order of the stages and not by inputCol,outputCol, right? – hillel Jun 04 '16 at 07:50
2

This is my understanding based on what I see in the code and how pipelines behave in practice. Since transformations are applied to the same lineage you cannot fan out, and as far as I know there are no join primitives so you cannot fan in. So I would even suggest there is no such thing as non-linear pipeline in ML other than logic encoded inside a single `PipelineStage`. I could be wrong though. Maybe you can ping this to dev@spark.apache.org and someone who wrote this description and / or pipeline logic can clarify / confirm / disprove. – zero323 Jun 04 '16 at 09:55
1

I've wrote a message to the mailing list. If nothing new comes up I'll accept this answer soon. – hillel Jun 05 '16 at 09:12
@hillel Thanks for accepting and bounty. Did you get any useful response? – zero323 Jun 06 '16 at 19:19
Unfortunately no. My post has not yet been accepted by the mailing list. Here's the link for future reference: http://apache-spark-developers-list.1001551.n3.nabble.com/Non-linear-DAG-ML-pipelines-in-Apache-Spark-td17796.html – hillel Jun 06 '16 at 21:14
Oh thanks. If you're using Scala you can try [`mario`](https://github.com/intentmedia/mario) (credits to [eliasah](https://stackoverflow.com/users/3415409/eliasah) for pointing that out). – zero323 Jun 07 '16 at 09:50
Have you seen http://apache-spark-developers-list.1001551.n3.nabble.com/DAG-in-Pipeline-td17868.html? – zero323 Jun 12 '16 at 18:49
1

thanks. I've replied there: http://apache-spark-developers-list.1001551.n3.nabble.com/DAG-in-Pipeline-tc17868.html#a17897. Let's see what happens. – hillel Jun 14 '16 at 09:10
@hillel Have you registered to the mailing list? It doesn't look like it is accepting your emails. – zero323 Jun 26 '16 at 20:54

Non linear (DAG) ML pipelines in Apache Spark

1 Answers1