7

In Apache NiFi, using FetchS3Object to read from an S3 bucket, I see it can reads all the object in bucket and as they are added. Is it possible:

  1. To configure the processor to read only objects added now onwards, not the one already present?
  2. How can I make it read a particular folder in the bucket?

NiFi seems great, just missing examples in their documentation for atleast the popular processors.

Bless
  • 5,052
  • 2
  • 40
  • 44
Sammy
  • 151
  • 2
  • 6

3 Answers3

9

A combination of ListS3 and FetchS3Object processors will do this:

  1. ListS3 - to enumerate your S3 bucket and generate flowfiles referencing each object. You can configure the Prefix property to specify a particular folder in the bucket to enumerate only a subset. ListS3 keeps track of what it has read using NiFi's state feature, so it will generate new flowfiles as new objects are added to the bucket.
  2. FetchS3Object - to read S3 objects into flowfile content. You can use the output of ListS3 by configuring the FetchS3Object's Bucket property to ${s3.bucket} and Object Key property to ${filename}.

enter image description here

James
  • 11,721
  • 2
  • 35
  • 41
  • Thanks James. I am already doing that successfully. My questions are very specific to certain use case where I want to read only new files as they are added and not old ones in the bucket – Sammy Jan 21 '17 at 19:56
  • ListS3 will identify new objects. You can let it run to read up to 'now' and discard the output for the existing files. – James Jan 21 '17 at 21:09
  • 1
    I ended up using ListS3 + FetchS3Object along with RouteOnAttribute where I added condition ${s3.lastModified:ge(1485189600000)} to route only recently added documents. – Sammy Jan 23 '17 at 22:36
  • James - Is there any way to mention more than one folder path from the same bucket in ListS3 processor? – zniv Feb 21 '18 at 17:57
  • @zniv ListS3 accepts only one Prefix. But you can use multiple ListS3 processors to achieve the same effect. – James Feb 22 '18 at 04:14
  • We can use multiple ListS3 processors but I have more than 20 such folders James. And I have to clear state of all ListS3 processors before executing again when I schedule the flow. Is there any other way we can list files from various but not all folders in a S3 bucket without ListS3 processor? – zniv Feb 27 '18 at 11:57
  • I recommend listing all of the items in the bucket, and then filtering the output flowfiles with a RouteOnAttribute processor to filter down to only the paths you are interested in. – James Feb 27 '18 at 14:35
2

Another approach would be to configure your S3 bucket to send SNS notifications, subscribe an SQS queue. NiFi would read from the SQS queue to receive the notifications, filter objects of interest, and process them.

See Monitoring An S3 Bucket in Apache NiFi for more on this approach.

James
  • 11,721
  • 2
  • 35
  • 41
2

Use GetSQS and fetchS3Object processor and configure your GETSQS processor to listen for notification for newly added file. It's a event driven approach as whenever a new file comes SQS queue sends notification to nifi. Use below link to get full clarifications: AWS-NIFI integration

khushbu kanojia
  • 250
  • 1
  • 3