1

I am looking for a way to implement a SparkCompute (or SparkSink) plugin that consumes from multiple inputs.

Looking at the interface, both SparkCompute and SparkSink plugins are limited to consume only one.

This is an excerpt from io.cdap.cdap.etl.api.batch.SparkCompute


  /**
   * Transform the input and return the output to be sent to the next stage in the pipeline.
   *
   * @param context {@link SparkExecutionPluginContext} for this job
   * @param input input data to be transformed
   * @throws Exception if there is an error during this method invocation
   */
  public abstract JavaRDD<OUT> transform(SparkExecutionPluginContext context, JavaRDD<IN> input) throws Exception;

(only one JavaRDD<IN> parameter is in the method signature)

Is there any way to access all the inputs (via SparkExecutionPluginContext context or something similar)?

egordoe
  • 918
  • 1
  • 5
  • 12

1 Answers1

2

In a CDAP pipeline, when a stage has multiple input stages, it receives the union of all the incoming data. This is the reason that the pipeline framework will not allow you to create a pipeline where the incoming schemas are different. The only exception is joiner plugins. So it is processing multiple inputs, but not in a way that lets you distinguish between them.

Albert Shau
  • 226
  • 1
  • 3
  • hello Albert, you can set `"multipleInputs": true` for any of the plugin types ([this](https://docs.cask.co/cdap/5.1.2/en/developer-manual/pipelines/developing-plugins/presentation-plugins.html#inputs) page clearly says that). If you set `"multipleInputs": true`, CDAP won't assume all of the input schemas are identical. – egordoe May 24 '19 at 16:52
  • That is documentation around the UI widgets and does not reflect that happens in the backend. I believe it was done to support joiner plugins in the UI, but it should never have been made configurable. If you try to set that and deploy a pipeline with different input schemas to a non-joiner, it will fail. – Albert Shau May 28 '19 at 17:51