2

I'm trying to use flink in both a streaming and batch way, to add a lot of data into Accumulo (A few million a minute). I want to batch up records before sending them to Accumulo. I ingest data either from a directory or via kafka, convert the data using a flatmap and then pass to a RichSinkFunction, which adds the data to a collection.

With the streaming data, batching seems ok, in that I can add the records to a collection of fixed size which get sent to accumulo once the batch threshold is reached. But for the batch data which is finite, I'm struggling to find a good approach to batching as it would require a flush time out in case there is no further data within a specified time. There doesn't seem to be an Accumulo connector unlike for Elastic search or other alternative sinks.

I thought about using a Process Function with a trigger for batch size and time interval, but this requires a keyed window. I didn't want to go down the keyed route as data looks to be very skewed, in that some keys would have a tonne of records and some would have very few. If I don't use a windowed approach, then I understand that the operator won't be parallel. I was hoping to lazily batch, so each sink only cares about numbers or an interval of time.

Has anybody got any pointers on how best to address this?

zargarf
  • 633
  • 6
  • 18
  • I don't know anything about Flink, but a lot of frameworks support MapReduce OutputFormat sinks, and Accumulo has some OutputFormats. Maybe that will work for you? – Christopher Sep 07 '18 at 00:37

1 Answers1

4

You can access timers in a sink by implementing ProcessingTimeCallback. For an example, look at the BucketingSink -- its open and onProcessingTime methods should get you started.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • you saved my time! I look for many information but get nothing。Add, now`BucketingSink` is deprected, you can refer to [StreamingFileSink](https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java) instead. KeyWord: flink, sink, timer, bacth, cache – ysjiang Oct 18 '19 at 06:39