0

I have data coming into a spooldir and I am picking it up using flume and forwarding it further for some processing.

There are some files which are not required so I am using the igonorePattern property in flume to avoid being picked up.

But the problem is, there are an equal number required and unrequired files that I receive and I have no control over the incoming data so I have to accept whatever I get into the spooldir.

Since I have quite a bit of these unrequired files I don't have the disk space to store them for a long time. Thus, I was wondering if there is a way for flume to automatically delete these files too just like it does for all .COMPLETED files (yes, I am deleting the files that gets picked up by flume)

1 Answers1

0

Flume Spooling Directory Source has no ability for deleting ignored files. It deletes immediatly/never only processed file(s).

There are three way to produce a solution for this problem.

First, you can fix the problem explicitly (with shell script or any other small program which can be find the file which have ignored pattern and delete it). In my opinion it is not a good way to do it.

Second, you can write your own custom spooling directory source with implementing the Flume Source Interface. It requires a lot of effort and a hard challenge for this kind of small problem.

Third, abusive solution, you can use Morphline Interceptor. Morphline interceptor is mentioned in this part of the Flume User Guide. Also you may want to take a look at Morphline Reference

Interceptors get the event from source, do some process, and finally forward it to the channel as you know.

If you choose the third solution you have to use kite-sdk for to do this. You have to add the Cloudera's Kite Morphlines Core dependency to your FLUME_CLASSPATH using flume-env.sh or simply add the jar in $APACHE_FLUME_HOME/lib

In this solution, your example Flume configuration will be:

a1.channels = ch-1
a1.sources = src-1
a1.sinks = k1
a1.sources.src-1.interceptors = morph

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /spool/dir
a1.sources.src-1.fileHeader = true
a1.sources.src-1.ignoredPattern = 'whatever'

a1.sources.src-1.interceptors.morph.type = org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
a1.sources.src-1.interceptors.morph.morphlineFile = /etc/flume-ng/conf/morphline.conf
a1.sources.src-1.interceptors.morph.morphlineId = morphline1

a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = ch-1
a1.sinks.k1.sink.directory = /roll/dir

Then you can create a custom morphline interceptor file as $APACHE_FLUME_HOME/conf/morphline.conf

In this conf file you can process what if you want, just be careful about the record object is returned to the child process.

It is also not a good solution but you can write your Java Code for doing any process during the Flume's transactions. On each event you can check the directory and if the file is unnecessary for you you can delete it. (You must be sure about the user which is run the java process have permissions in this directory)

morphlines : [
  {
    id : morphline1
    importCommands : ["org.kitesdk.**"]
    commands : [
      {
        readJson { }
      }
      {
        java {
          imports : """
            import java.io.File;
            import java.io.IOException;
          """
          code : """
            try {
                // This code from my flume agent, you may want to use it, but it is not necessary
                // JsonNode rootNode = (JsonNode) record.getFirstValue(Fields.ATTACHMENT_BODY);    

                // You can traverse in the relevant directory
                // and find the ignored pattern manually
                // then you can delete it with java code

                //Second part of my code
                //String rootNodeStr = rootNode.toString();
                //record.put("rootNodeStr", rootNodeStr.getBytes(StandardCharsets.UTF_8));
              }
            } catch (IOException e) {
              logger.error("So sad",e);
            }
            return child.process(record);
          """
        }
      }
      {
        setValues {
          _attachment_body : "@{rootNodeStr}"
        }
      }
    ]
  }
]

I hope it would be helpful.