0

I have built a scala application in Spark v.1.6.0 that actually combines various functionalities. I have code for scanning a dataframe for certain entries, I have code that performs certain computation on a dataframe, I have code for creating an output, etc.

At the moment the components are 'statically' combined, i.e., in my code I call the code from a component X doing a computation, I take the resulting data and call a method of component Y that takes the data as input.

I would like to get this more flexible, having a user simply specify a pipeline (possibly one with parallel executions). I would assume that the workflows are rather small and simple, as in the following picture:

exemplary workflow

However, I do not know how to best approach this problem.

  • I could build the whole pipeline logic myself, which will probably result in quite some work and possibly some errors too...
  • I have seen that Apache Spark comes with a Pipeline class in the ML package, however, it does not support parallel execution if I understand correctly (in the example the two ParquetReader could read and process the data at the same time)
  • there is apparently the Luigi project that might do exactly this (however, it says on the page that Luigi is for long-running workflows, whereas I just need short-running workflows; Luigi might be overkill?)?

What would you suggest for building work/dataflows in Spark?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
navige
  • 2,447
  • 3
  • 27
  • 53

1 Answers1

1

I would suggest to use Spark's MLlib pipeline functionality, what you describe sounds like it would fit the case well. One nice thing about it is that it allows Spark to optimize the flow for you, in a way that is probably smarter than you can.

You mention it can't read the two Parquet files in parallel, but it can read each separate file in a distributed way. So rather than having N/2 nodes process each file separately, you would have N nodes process them in series, which I'd expect to give you a similar runtime, especially if the mapping to y-c is 1-to-1. Basically, you don't have to worry about Spark underutilizing your resources (if your data is partitioned properly).

But actually things may even be better, because Spark is smarter at optimising the flow than you are. An important thing to keep in mind is that Spark may not do things exactly in the way and in the separate steps as you define them: when you tell it to compute y-c it doesn't actually do that right away. It is lazy (in a good way!) and waits until you've built up the whole flow and ask it for answers, at which point it analyses the flow, applies optimisations (e.g. one possibility is that it can figure out it doesn't have to read and process a large chunk of one or both of the Parquet files, especially with partition discovery), and only then executes the final plan.

sgvd
  • 3,819
  • 18
  • 31
  • It is a good way to optimize your code indeed. However, no free lunch here! I don't find spark's pipelines to be super parallel-optimized. For a simple cross validation pipeline with a logistic regression as estimator, the CPU utilization of 3 over 4 EC2 slaves is under 50%. No great! Everything is cache in RAM though. I am actually seeking on how to optimize it. – Boris Feb 07 '17 at 15:35