1

I am setting up an ad tracking system where I need to store & analyze access logs. I am using an image pixel for this purpose. The parameters to be tracked will be sent via HTTP Get parameters. Any call to the pixel will contain the parameters - like IP, userid & timestamp that I need to store and analyze.

Which one of the work flows will be better? 1. Make use of apache logging. Setup a process to gather the logs in a common place (HDFS?) and analyse. 2. Store each log entry into a data store (Cassandra?). Analyse.

Would be good to know the pro's and con's of both approaches from someone who has done this before.

Regards,

gokul
  • 139
  • 11

2 Answers2

2

I think the combination of both cassandra and HDFS will do the trick. I have done a similar implementation where i send daily logs to Cassandra then I have written a map reduce job to analyze and send these logs to HDFS file system end of each day. So for a given time you can get the most recent logs by accessing the cassandra cluster and get old archived data using HDFS.

I have further explained the architecture in following article[1]

[1] - http://sparkletechthoughts.blogspot.com/2012/09/how-distributed-logging-works-in-wso2.html

According to this implementation real time logs are taken from cassandra and long term archived logs are taken from HDFS file system

poohdedoo
  • 1,258
  • 1
  • 18
  • 42
  • this should work. Will be using log analysis for sure, but am also looking for a combined solution with something near real time as well. – gokul Sep 17 '13 at 04:22
2

It depends on what is your primarily motivation.

If the motivation is to return from that request as soon as possible, then your best bet is to just log the request and move on and analyse in the background. If you have many machines behind a load balancer, then you might want to set up a centralised logging like we have done, and described at How do I set up PHP Logging to go to a remote server?. Once your logs are in one place, you can choose to fill it back into your store of choice. This implementation can also be extended to write logs to multiple locations, just in case you are looking to avoid a single point of failure.

If the motivation is to move it to your permanent data store and process it in real time, then logging is superfluous and you can focus on either data store. @poohdedoo's should work fine then.

Community
  • 1
  • 1
Shreeni
  • 3,222
  • 7
  • 27
  • 39