I have a simple DataFlow java job that reads a few lines from a .csv file. Each line contains a numeric cell, which represents how many steps a certain function has to be performed on that line.
I don't want to perform that using a traditional For loop within the function, in case these numbers become very large. What is the right way to do this using the parallel-friendly DataFlow methodology?
Here's the current Java code:
public class SimpleJob{
static class MyDoFn extends DoFn<String, Integer> {
public void processElement(ProcessContext c) {
String name = c.element().split("\\,")[0];
int val = Integer.valueOf(c.element().split("\\,")[1]);
for (int i = 0; i < val; i++) // <- what's the preferred way to do this in DF?
System.out.println("Processing some function: " + name); // <- do something
c.output(val);
}
}
public static void main() {
DataflowPipelineOptions options = PipelineOptionsFactory
.as(DataflowPipelineOptions.class);
options.setProject(DEF.ID_PROJ);
options.setStagingLocation(DEF.ID_STG_LOC);
options.setRunner(DirectPipelineRunner.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply(TextIO.Read.from("Source.csv"))
.apply(ParDo.of(new MyDoFn()));
pipeline.run();
}
}
This is what the "source.csv" looks like (so each number represents how many times I want to run a parallel function on that line):
Joe,3
Mary,4
Peter,2