0

I am implementing an UnboundedReader in order to use a custom data source (based on a company-internal, subscription based Java API). When I execute a pipeline I notice that multiple instances of UnboundedReader are created. How does BEAM decide how many times to call the

public abstract UnboundedSource.UnboundedReader<OutputT> createReader(PipelineOptions options, CheckPointMarkT checkpointMark)

method of UnboundedSource?

My split() method is implemented as:

public List<? extends UnboundedSource<MyRecord, MyCheckpointMark>> split(int desiredNumSplits, PipelineOptions options) throws Exception {
    List<MySubscriptionSource> list = new ArrayList<>(1);
    list.add(this);
    return list;
}

Is there a way to force only a single reader to be created?

alex.tashev
  • 235
  • 3
  • 10

1 Answers1

1

I did some digging and read the direct runner source. It's written to randomly close the existing reader (with a probability of 5%) and force restoring a checkpoint: https://github.com/apache/beam/blob/a679d98cbcc49b01528c168cce8b578338a5bcdd/runners/direct-java/src/main/java/org/apache/beam/runners/direct/UnboundedReadEvaluatorFactory.java#L150

There are no comments to say why - my guess is that it's done to simulate some rate of failure

alex.tashev
  • 235
  • 3
  • 10