I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.
At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X
doing a computation, I take the resulting data and call a method of component Y
that takes the data as input.
I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:
However, I do not know how to best approach this problem.
- I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
- I have seen that Apache Spark comes with a
Pipeline
class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time) - there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?
What would you suggest for building work/dataflows in Spark?