I have implemented a data pipeline with multiple unbounded sources & side inputs, join data with sliding window (30s & every 10s) and emit the transformed output into a Kafka Topic. The issue i have is, the data received in the first 10 seconds of the window is emitted 3 times (i.e.) triggers whenever a new window starts until the first window is completed. How to emit the transformed data only once or avoid duplicates ?
I have used discard fired panes and it does not make a difference. Whenever i try setting Window closing behavior as FIRE_ALWAYS/FIRE_IF_NON_EMPTY, it throws the below error.
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Empty PCollection accessed as a singleton view. Consider setting withDefault to provide a default value at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:332) at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:302) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:197) at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:64) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313) at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299) at y.yyy.main(yyy.java:86) Caused by: java.lang.IllegalArgumentException: Empty PCollection accessed as a singleton view. Consider setting withDefault to provide a default value at org.apache.beam.sdk.transforms.View$SingletonCombineFn.identity(View.java:378) at org.apache.beam.sdk.transforms.Combine$BinaryCombineFn.extractOutput(Combine.java:481) at org.apache.beam.sdk.transforms.Combine$BinaryCombineFn.extractOutput(Combine.java:429) at org.apache.beam.sdk.transforms.Combine$CombineFn.apply(Combine.java:387) at org.apache.beam.sdk.transforms.Combine$GroupedValues$1.processElement(Combine.java:2089)
data.apply("Transform", ParDo.of(
new DoFn<String, Row>() {
private static final long serialVersionUID = 1L;
@ProcessElement
public void processElement(
ProcessContext processContext,
final OutputReceiver<Row> emitter) {
String record = processContext.element();
final String[] parts = record.split(",");
emitter.output(Row.withSchema(sch).addValues(parts).build());
}
})).apply(
"window1",
Window
.<Row>into(
SlidingWindows
.of(Duration.standardSeconds(30))
.every(Duration.standardSeconds(10)))
.withAllowedLateness(
Duration.ZERO,
Window.ClosingBehavior.FIRE_IF_NON_EMPTY)
.discardingFiredPanes());
Kindly guide me to trigger the window only once (i.e.) i don't want to send the records that are already processed
Update: The Above error for Side Input occurs frequently & its not because of windows, seems like an issue in Apache Beam (https://issues.apache.org/jira/browse/BEAM-6086)
I tried using State for identifying if a row is already processed or not, but the state is not retained or getting set. (i.e.) I always get null while reading the state.
public class CheckState extends DoFn<KV<String,String>,KV<Integer,String>> {
private static final long serialVersionUID = 1L;
@StateId("count")
private final StateSpec<ValueState<String>> countState =
StateSpecs.value(StringUtf8Coder.of());
@ProcessElement
public void processElement(
ProcessContext processContext,
@StateId("count") ValueState<String> countState) {
KV<String,String> record = processContext.element();
String row = record.getValue();
System.out.println("State: " + countState.read());
System.out.println("Setting state as "+ record.getKey() + " for value"+ row.split(",")[0]);
processContext.output(KV.of(current, row));
countState.write(record.getKey());
}