I am using the Logstash S3 Input plugin to process S3 access logs.
The access logs are all stored in a single bucket, and there are thousands of them. I have set up the plugin to only include S3 objects with a certain prefix (based on date eg 2016-06).
However, I can see that Logstash is re-polling every object in the Bucket, and not taking account of objects it has previously analysed.
{:timestamp=>"2016-06-21T08:50:51.311000+0000", :message=>"S3 input: Found key", :key=>"2016-06-01-15-21-10-178896183CF6CEBB", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"111", :method=>"list_new_files"}
ie
Every minute (or whatever interval you have set) Logstash starts at the beginning of the bucket and makes an AWS API call for every object it finds. It seems to do this to find out what the last modified time of the object is, so that it can include relevant files for analysis. This obviously slows everything down, and doesn't give me real time analysis of the access logs.
Other than constantly updating the prefix to match only recent files, is there some way to make Logstash skip reading older S3 Objects?
There is a sincedb_path parameter for the plugin, but that only seems to relate to where the data about what file has last been analysed is written.