2

I have a simple DataFlow java job that reads a few lines from a .csv file. Each line contains a numeric cell, which represents how many steps a certain function has to be performed on that line.

I don't want to perform that using a traditional For loop within the function, in case these numbers become very large. What is the right way to do this using the parallel-friendly DataFlow methodology?

Here's the current Java code:

public class SimpleJob{

    static class MyDoFn extends DoFn<String, Integer> {

        public void processElement(ProcessContext c) {
            String name = c.element().split("\\,")[0];
            int val = Integer.valueOf(c.element().split("\\,")[1]);
            for (int i = 0; i < val; i++) // <- what's the preferred way to do this in DF?
                System.out.println("Processing some function: " + name); // <- do something
            c.output(val);
        }

    }

    public static void main() {

        DataflowPipelineOptions options = PipelineOptionsFactory
                .as(DataflowPipelineOptions.class);
        options.setProject(DEF.ID_PROJ);
        options.setStagingLocation(DEF.ID_STG_LOC);
        options.setRunner(DirectPipelineRunner.class);

        Pipeline pipeline = Pipeline.create(options);

        pipeline.apply(TextIO.Read.from("Source.csv"))
                .apply(ParDo.of(new MyDoFn()));

        pipeline.run();
    }
}

This is what the "source.csv" looks like (so each number represents how many times I want to run a parallel function on that line):

Joe,3
Mary,4
Peter,2

Allan Veloso
  • 5,823
  • 1
  • 38
  • 36
VS_FF
  • 2,353
  • 3
  • 16
  • 34

1 Answers1

3

Curiously enough, this is one of the motivating use cases for Splittable DoFn! That API is currently in heavy development.

However, until that API is available, you can basically mimic most of what it would have done for you:

class ElementAndRepeats { String element; int numRepeats; }
PCollection<String> lines = p.apply(TextIO.Read....)
PCollection<ElementAndRepeats> elementAndNumRepeats = lines.apply(
    ParDo.of(...parse number of repetitions from the line...));
PCollection<ElementAndRepeats> elementAndNumSubRepeats = elementAndNumRepeats
    .apply(ParDo.of(
        ...split large numbers of repetitions into smaller numbers...))
    .apply(...fusion break...);
elementAndNumSubRepeats.apply(ParDo.of(...execute the repetitions...))

where:

  • "split large numbers of repetitions" is a DoFn that, e.g., splits an ElementAndRepeats{"foo", 34} into {ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 10}, ElementAndRepeats{"foo", 4}}
  • fusion break - see here, to prevent the several ParDo's from being fused together, defeating the parallelization
jkff
  • 17,623
  • 5
  • 53
  • 85
  • Tried with KV instead of custom ElementsAndRepeats as you suggested. Work well. Two more questions: (1) what is the appropriate number of mini-loops to break this down to? You are showing 10, but any best practice? Imagine original numbers in order of hundreds of thousands, or possibly millions? (2) do you have any examples of how to do fusion break? I read your link, but doesn't seem clear from it. I posted the updated full code in a related question here: [link](http://stackoverflow.com/questions/41091713/sharing-bigtable-connection-object-among-dataflow-dofn-sub-classes) – VS_FF Dec 15 '16 at 18:30
  • 1 - it doesn't matter much. "1" would be too low because there'd be some overhead associated with each ElementAndRepeats; 1M would be too high because you wouldn't get extra parallelization. Anything on the order of dozens to thousands is likely to give you near-identical performance. 2 - one way to do it is: https://github.com/apache/incubator-beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L307 – jkff Dec 15 '16 at 19:29