0

My usecase is to process S3 access logs(having those 18 fields) periodically and push to table in RDS. I'm using AWS data pipeline for this task to run everyday to process previous day's logs.

I decided to split the task into two activities 1. Shell Command Activity : To process s3 access logs and create a csv file 2. Hive Activity : To read data from csv file and insert to RDS table.

My input s3 bucket has lots of log files hence first activity fails due to out of memory error while staging. However i don't want to stage all the logs, staging the previous day's log is enough for me. I searched around internet but didn't get any solution. How do i achieve this ? Is my solution the optimal one ? Does any solution better than this exist ? Any suggestions will be helpful

Thanks in Advance

ramya
  • 11
  • 2

3 Answers3

0

You can define your S3 data node use timestamps. For e.g. you can say the directory path is

s3://yourbucket/ #{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}

Since your log files should have a timestamp in the name (or they could be organized by timestamped directories).

This will only stage the files matching that pattern.

  • Here's a list of expressions you can use: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-reference-functions-datetime.html – Vinayak Thapliyal Jul 07 '15 at 19:18
  • Thanks for ur answer .. When i gave the expression its trying to match with the exact name but log files have some prefix appended to the timestamp.. I tried giving regular expressions and much more methods nothing worked couldn't find any info in aws docs.. Do you know some way to do achieve what i want ?? – ramya Jul 15 '15 at 07:12
  • I came to know that AWS datapipeline doesn't support regex processing .. When enabling logging for a bucket , Is it possible to create folders daily and accommodate that day's access logs in S3 ?? If its possible how do i achieve that ?? – ramya Jul 27 '15 at 07:26
0

You may be recreating a solution that is already done by Logstash (or more precisely the ELK stack).

http://logstash.net/docs/1.4.2/inputs/s3

Logstash can consume S3 files.

Here is a thread on reading access logs from S3

https://groups.google.com/forum/#!topic/logstash-users/HqHWklNfB9A

We use Splunk (not-free) that has the same capabilities through its AWS plugin.

user1452132
  • 1,758
  • 11
  • 21
0

May I ask why are you pushing the access logs to RDS? ELK might be a great solution for you. You can build it on your own or use ELK-as-a-service from Logz.io (I work for Logz.io).

It enables you to easily define an S3 bucket, get all your logs read regularly from the bucket and ingested by ELK and view them in preconfigured dashboards.

Tomer Levy
  • 357
  • 1
  • 4
  • i'm pushing to RDS for the following two reasons 1) I want to run queries on stored data and perform some analysis whenever needed 2) I want a persistance storage I have no idea about ELK actually.. Will take a look at it Thanks – ramya Jul 15 '15 at 05:50