0

I have implemented a data pipeline with multiple unbounded sources & side inputs, join data with sliding window (30s & every 10s) and emit the transformed output into a Kafka Topic. The issue i have is, the data received in the first 10 seconds of the window is emitted 3 times (i.e.) triggers whenever a new window starts until the first window is completed. How to emit the transformed data only once or avoid duplicates ?

I have used discard fired panes and it does not make a difference. Whenever i try setting Window closing behavior as FIRE_ALWAYS/FIRE_IF_NON_EMPTY, it throws the below error.

Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Empty PCollection accessed as a singleton view. Consider setting withDefault to provide a default value at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:332) at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:302) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:197) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:64) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299) at y.yyy.main(yyy.java:86) Caused by: java.lang.IllegalArgumentException: Empty PCollection accessed as a singleton view. Consider setting withDefault to provide a default value at org.apache.beam.sdk.transforms.View$SingletonCombineFn.identity(View.java:378) at org.apache.beam.sdk.transforms.Combine$BinaryCombineFn.extractOutput(Combine.java:481) at org.apache.beam.sdk.transforms.Combine$BinaryCombineFn.extractOutput(Combine.java:429) at org.apache.beam.sdk.transforms.Combine$CombineFn.apply(Combine.java:387) at org.apache.beam.sdk.transforms.Combine$GroupedValues$1.processElement(Combine.java:2089)

data.apply("Transform", ParDo.of(
  new DoFn<String, Row>() {

    private static final long serialVersionUID = 1L;

    @ProcessElement
    public void processElement(
      ProcessContext processContext,
      final OutputReceiver<Row> emitter) {

        String record = processContext.element();
        final String[] parts = record.split(",");
        emitter.output(Row.withSchema(sch).addValues(parts).build());
    }
  })).apply(
    "window1",
    Window
      .<Row>into(
        SlidingWindows
          .of(Duration.standardSeconds(30))
          .every(Duration.standardSeconds(10)))
      .withAllowedLateness(
        Duration.ZERO,
        Window.ClosingBehavior.FIRE_IF_NON_EMPTY)
  .discardingFiredPanes());

Kindly guide me to trigger the window only once (i.e.) i don't want to send the records that are already processed

Update: The Above error for Side Input occurs frequently & its not because of windows, seems like an issue in Apache Beam (https://issues.apache.org/jira/browse/BEAM-6086)

I tried using State for identifying if a row is already processed or not, but the state is not retained or getting set. (i.e.) I always get null while reading the state.

public class CheckState extends DoFn<KV<String,String>,KV<Integer,String>> {
  private static final long serialVersionUID = 1L;

  @StateId("count")
  private final StateSpec<ValueState<String>> countState =
                     StateSpecs.value(StringUtf8Coder.of());

  @ProcessElement
  public void processElement(
    ProcessContext processContext,
    @StateId("count") ValueState<String> countState) {

        KV<String,String> record = processContext.element();
        String row = record.getValue();
        System.out.println("State: " + countState.read());
        System.out.println("Setting state as "+ record.getKey() + " for value"+ row.split(",")[0]);
        processContext.output(KV.of(current, row));
        countState.write(record.getKey());
    }
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Gowtham
  • 87
  • 1
  • 14
  • Can you elaborate on where you see the duplicates? What step happens after the window is it a GBK or a combiner? – Reza Rokni Jul 17 '19 at 09:56
  • @RezaRokni After this step, Im using SqlTransform to join 5 unbounded data streams and trying to print after. While printing it prints the same data 3 times (once per every sliding window initiation until the first/main window is complete) – Gowtham Jul 17 '19 at 11:10
  • Did you want a sliding window or fixed window? As the sliding window will continue to output all elements which fall in its min/max boundary until the min boundary passes over the elements timestamp. – Reza Rokni Jul 17 '19 at 12:29

1 Answers1

3

If I have understood the issue correctly, it can be related to the use of sliding windows in the pipeline:

A sliding time window overlap, nice explanation from Beam guides Window Functions

"Because multiple windows overlap, most elements in a data set will belong to more than one window. This kind of windowing is useful for taking running averages of data; ..."

Fixed windows however will not overlap:

"A fixed time window represents a consistent duration, non overlapping time interval in the data stream.."

Reza Rokni
  • 1,206
  • 7
  • 12
  • I am using sliding window because i need the window to wait up to 30 seconds between the arrival of first & last record (From Multiple Streams). Also, i need the arrived records to be processed only once for every window (30s). But, since the sliding window overlaps, it accumulates the processed records for 3 windows (every 10s) and it processes the same records thrice. I just need a way to avoid this (i.e.) even though i use sliding window, i want the data to be processed only once. Also, Can you help me understand, why the state is null while using the sliding window? Thanks, – Gowtham Jul 18 '19 at 04:01