1

I'm trying to use Flume to ship my access logs to a Spark cluster. But there're a bunch of limitations that forces me to write a custom application (or a Flume source) to read the log files.

What I'm trying to do is to get Flume client to signal this source in case it can't write the data to the sink. As we constantly have long network outages and there's not enough disk space to queue the failed logs on disk until the network is back up. Instead I would like to "tell" the source to stop reading the logs until the network is up, then "tell" it to start again. But so far I haven't seen any kind of callbacks on errors in the documentation.

Is there anyway I can achieve such scenario without reinventing the wheel?

Emam
  • 620
  • 1
  • 6
  • 12
  • Which source type are you using? Are you copying the log files to HDFS or are you streaming the output to SparkStreaming with AvroSink? Your scenario is actually something Flume was designed to handle, but you should provide a bit more insight into your configuration for us to give you pointers. Maybe a copy of your flume.conf would do. – Erik Schmiegelow Sep 08 '15 at 15:47
  • Ah, sorry, forgot to mention that I'm using SparkStreaming. So, I'm pushing the logs to an Avro sink on Spark. I just need to guarantee that the stream would continue automatically after network outages, without duplicating the logs on disk during the outage (normal access log files + Flume disk buffering). – Emam Sep 08 '15 at 15:57

1 Answers1

0

OK, so now that we've clarified a few questions, here's what actually happens:

Flume Source - SpoolDir or similar -> Channel -> AvroSink (SparkStreaming)

Flume parses a file and converts lines of that files to FlumeEvents, which get spooled to the Channel. This happens as quickly as possible, at least until the channel is full. If the Channel is full, the source will back off until the channel accepts records again. You can control the capacity of a Channel by specifying the memory and amount of records which a channel can hold.

The channel will be read by AvroSink. If AvroSink cannot submit the events because of a network outage, it will stop consuming from the channel, thus leading to a full channel.

You will at that moment see messages in Flume's log file indicating that the sinks cannot keep up with the sources, which is expected behaviour, as your channel acts as a back buffer for your (unreliable) sink. You will not experience duplicate processing of events, however you might loose some events to outages if you choose non durable channel types such as MemoryChannel.

Erik Schmiegelow
  • 2,739
  • 1
  • 18
  • 22
  • I'm not worried about duplicate processing of my events by Flume. Just to clarify a bit more, my question is how can I signal a custom source to back off, or stop generating any more events. Because as far as I've seen, custom sources and sinks are completely detached, and it's flume's task to queue the unprocessed events somewhere until the sink is back online. – Emam Sep 09 '15 at 08:14
  • As I wrote, the back off mechanism is triggered by the channel capacity. If the channel is full, the sources will back off. There's no functionality to do that explicitly. That back off mechanism is actually quite reliable – Erik Schmiegelow Sep 09 '15 at 09:26