0

I've written a script to continuously pull all my S3 bucket logfiles down to my Logstash server, so it can be parsed using the patterns in this pull request. Alas, given the script recreates the logfile from scratch instead of just appending to it, Logstash's file input isn't seeing any new changes. Any ideas?

Script below:

#!/usr/bin/ruby

require 'rubygems'
require 'aws/s3'

# for non-us buckets, we need to change the endpoint
AWS.config(:s3_endpoint => "s3-eu-west-1.amazonaws.com")

# connect to S3
s3 = AWS::S3.new(:access_key_id => S3_ACCESS_KEY, :secret_access_key => S3_SECRET_KEY)

# grab the bucket where the logs are stored
bucket = s3.buckets[BUCKET_NAME]

File.open("/var/log/s3_bucket.log", 'w') do |file|

  # grab all the objects in the bucket, can also use a prefix here and limit what S3 returns
  bucket.objects.with_prefix('staticassets-logs/').each do |log|
    log.read do |line|
      file.write(line)
    end
  end
end

Any help? Thanks!

aendra
  • 5,286
  • 3
  • 38
  • 57

2 Answers2

1

I ended up changing my script to the following:

#!/bin/bash 
export PATH=$PATH:/bin:/usr/bin
cd /var/log/s3/$S3_BUCKET/
export s3url=s3://$S3_BUCKET/$S3_PREFIX
s3cmd -c /home/logstash/.s3cfg sync --skip-existing $s3url .

...And changing it from evaluating a single logfile to globbing the entire /var/log/s3/my_bucket directory:

input {
  file {
    type => "s3-access-log"
    path => "/var/log/s3/$S3_BUCKET/$S3_BUCKET/*"
    sincedb_path => "/dev/null"
    start_position => "beginning"
  }
}
filter {
    if [type] == "s3-access-log" {
        grok {
            patterns_dir => ["/etc/logstash/conf.d/patterns"]
            match => { "message" => "%{S3_ACCESS_LOG}" }
            remove_field => ["message"]
        }
        date {
            match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
            remove_field => ["timestamp"]
        }
    }
}
output {
  elasticsearch { host => localhost }
  stdout { codec => rubydebug }
}

Works brilliantly now.

aendra
  • 5,286
  • 3
  • 38
  • 57
0

We use cloudtrail for auditing and use an s3 input with a cloudtrail codec which automatically pulls the actual logs out of the top cloudtrail object.

For your use case you should also be able to use an s3 input in order to get the actual log content and then operate your S3 grok filters on anything with that type.

EDIT: Be sure to use the "backup to bucket" option with the S3 input because it operates on everything in the bucket everytime even though it only pushes the most recent logs through logstash.

clly
  • 1
  • I have another script that grabs the CloudTrail logs and it works great — because it appends to the end of the file instead of recreating it on every cron run. This tends to work better because it seems sincedb gets confused if the file is rewritten recreated instead of just modified (Unless I'm missing something). Alas, this doesn't really work with S3 because of the different logfile prefix structure... – aendra Oct 30 '14 at 15:43
  • We haven't run into that problem but maybe it's just a matter of time. We also might not run into that problem with cloudtrail logs because it's just a new file every 15 minutes. Although it looks like that you're just pulling down everything from the staticassets-logs/ prefix (that could just be an example) I would say that multiple s3 inputs would allow you to do this and increase throughput (though it could be a pain). – clly Oct 30 '14 at 16:47