4

I am seeing intermittent dropped records(only for error messages though not for success ones). We have a test case that intermittenly fails/passes because of a lost record. We are using "org.apache.beam.sdk.testing.TestPipeline.java" in the test case. This is the relevant setup code where I have tracked the dropped record too ....

    PCollectionTuple processed = records
        .apply("Process RosterRecord", ParDo.of(new ProcessRosterRecordFn(factory))
            .withOutputTags(TupleTags.OUTPUT_INTEGER, TupleTagList.of(TupleTags.FAILURE))
        );
    errors = errors.and(processed.get(TupleTags.FAILURE));

    PCollection<OrderlyBeamDto<Integer>> validCounts = processed.get(TupleTags.OUTPUT_INTEGER);

    PCollection<OrderlyBeamDto<Integer>> errorCounts = errors
        .apply("Flatten Roster File Error Count", Flatten.pCollections())
        .apply("Publish Errors", ParDo.of(new ErrorPublisherFn(factory)));

The relevant code in ProcessRosterRecordFn.java is this

        if(dto.hasValidationErrors()) {

            RosterIngestError error = new RosterIngestError(record.getRowNumber(), record.toTitleValue());

            error.getValidationErrors().addAll(dto.getValidationErrors());
            error.getOldValidationErrors().addAll(dto.getOldValidationErrors());

            log.info("Tagging record row number="+record.getRowNumber());
            c.output(TupleTags.FAILURE, new OrderlyBeamDto<>(error));

            return;

        }

I see this log for the lost record of Tagging record row for 2 rows that fail. After that however, inside the first line of ErrorPublisherFn.java, we log immediately after receiving each message. We only receive 1 of the 2 rows SOMETIMES. When we receive both, the test passes. The test is very flaky in this regard.

Apache Beam is really annoying in it's naming of threads(they are all the same name), so I added a logback thread hashcode to get more insight and I don't see any and the ErrorPublisherFn could publish #4 on any thread anyways.

Ok, so now the big question: How to insert more things to figure out why this is being dropped INTERMITTENTLY?

Do I have to debug apache beam itself? Can I insert other functions or make changes to figure out why this error is 'sometimes' lost on some test runs and not others?

EDIT: Thankfully, this set of tests are not testing errors upstream and this line "errors = errors.and(processed.get(TupleTags.FAILURE));" can be removed which forces me to remove ".apply("Flatten Roster File Error Count", Flatten.pCollections())" and in removing those 2 lines, the issue goes away for 10 test runs in a row(ie. can't completely say it is gone with this flaky stuff going on). Are we doing something wrong in the join and flattening? I checked the Error structure and rowNumber is a part of equals and hashCode so there should be no duplicates and I am not sure why it would be intermittently failure if there are duplicate objects either.

What more can be done to debug here and figure out why this join is not working in the TestPipeline?

How to get insight into the flatten and join so I can debug why we are losing an event and why it is only 'sometimes' we lose the event?

Is this a windowing issue? even though our job started with a file to read in and we want to process that file. We wanted a constant dataflow stream available as google kept running into limits but perhaps this was the wrong decision?

Dean Hiller
  • 19,235
  • 25
  • 129
  • 212
  • @Davor Bonaci not sure if I can tag you but you seemed to know about bounded / unbounded sources. – Dean Hiller Oct 26 '21 at 15:00
  • You can set breakpoints in `expand`, `process` of a DoFN, or `split` to debug in which case the issue is happening. If the problem is with duplicate values you can try `duplicated(keep=keep)`[1] when the values are returned. [1]https://beam.apache.org/releases/pydoc/2.31.0/apache_beam.dataframe.frames.html#apache_beam.dataframe.frames.DeferredSeries.duplicated – Jose Gutierrez Paliza Oct 26 '21 at 16:04
  • Is this a batch or streaming pipeline. If streaming what windowing/triggerring settings did you use ? Is it possible that some events were dropped due to being late ? – chamikara Oct 26 '21 at 16:50
  • @chamikara This IS a streaming job, yes. We would however like to use streaming with 'bounded' PCollections so that windowing is not a problem. Receive a file event coming in but create a bounded PCollection of each file would be best. – Dean Hiller Oct 27 '21 at 12:19

0 Answers0