2

If we want to get data from spoolDir which contains Gzip file in it, what should I change for the source in the Flume process? Just have a customized EventDeserializer or also need new source type(eg., a customized GzipSpoolDirectorySource instead of the default spooldir) for the flume process?

f_puras
  • 2,521
  • 4
  • 33
  • 38
user3502577
  • 99
  • 2
  • 7
  • Related topic --> http://stackoverflow.com/questions/18376831/compressed-file-ingestion-using-flume – frb Jun 12 '15 at 06:41
  • Do you want to unpack the file and spool individual events within or do you want to process the entire file? – Erik Schmiegelow Jun 12 '15 at 13:58
  • @ErikSchmiegelow Hi, I want to process the entire gzip file, and I looked at the default LineDeserializer, the constructor only accepts ResettableInputStream, but for our case, we should figure out the way to decode gzipped data from ResettableInputStream. Otherwise, it seems that we should customize spoolDir type(also spoolFileEventReader and Deserializer which related to spoolSourceDirectory) – user3502577 Jun 12 '15 at 16:31

1 Answers1

1

OK, so if you don't want to unpack your GZIP files at Flume level, that#s actually quite easy. You can configure your Spool Dir source to use a BlobDeserializer:

https://flume.apache.org/FlumeUserGuide.html#event-deserializers

This will parse the entire file as one event and spool that. If you want to store that to HDFS for instacne, make sure that you activate the fileHeader property on your spool dir source. You can then use the %{file} variable in your path, which effectively allows you to use flume as a one to one file copy mechanism.

Erik Schmiegelow
  • 2,739
  • 1
  • 18
  • 22