I'm currently processing about 300 GB of log files on a 10 servers hadoop cluster. My data is being saved in folders named YYMMDD so each day can be accessed quickly.
My problem is that I just found out today that the timestamps I have in my log files are in DST (GMT -0400) instead of UTC as expected. In short, this means that logs/20110926/*.log.lzo contains elements from 2011-09-26 04:00 to 2011-09-27 20:00 and it's pretty much ruining any map/reduce done on that data (i.e. generating statistics).
Is there a way to do a map/reduce job to re-split every log files correctly? From what I can tell, there doesn't seem to be a way using streaming to send certain records in output file A and the rest of the records in output file B.
Here is the command I currently use:
/opt/hadoop/bin/hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar \
-D mapred.reduce.tasks=15 -D mapred.output.compress=true \
-D mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec \
-mapper map-ppi.php -reducer reduce-ppi.php \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-file map-ppi.php -file reduce-ppi.php \
-input "logs/20110922/*.lzo" -output "logs-processed/20110922/"
I don't know anything about java and/or creating custom classes. I did try the code posted at http://blog.aggregateknowledge.com/2011/08/30/custom-inputoutput-formats-in-hadoop-streaming/ (pretty much copy/pasted what was on there) but I couldn't get it to work at all. No matter what I tried, I would get a "-outputformat : class not found" error.
Thank you very much for your time and help :).