I'm writing a MapReduce job to mine webserver logs. The input is from text files, output goes to a MySQL database. Problem is, if one record fails to insert, for whatever reason, like data exceeding column size, the whole job fails and nothing gets written to the database. Is there a way so that the good records are still persisted? I guess one way would be to validate the data but that couples the client with the database schema too much for my taste.
I'm not posting the code because this is not particularly a code issue.
Edit:
Reducer:
protected void reduce(SkippableLogRecord rec,
Iterable<NullWritable> values, Context context) {
String path = rec.getPath().toString();
path = path.substring(0, min(path.length(), 100));
try {
context.write(new DBRecord(rec), NullWritable.get());
LOGGER.info("Wrote record {}.", path);
} catch (IOException | InterruptedException e) {
LOGGER.error("There was a problem when writing out {}.", path, e);
}
}
Log:
15/03/01 14:35:06 WARN mapred.LocalJobRunner: job_local279539641_0001
java.lang.Exception: java.io.IOException: Data truncation: Data too long for column 'filename' at row 1
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.io.IOException: Data truncation: Data too long for column 'filename' at row 1
at org.apache.hadoop.mapreduce.lib.db.DBOutputFormat$DBRecordWriter.close(DBOutputFormat.java:103)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/03/01 14:35:06 INFO mapred.LocalJobRunner: reduce > reduce
15/03/01 14:35:07 INFO mapreduce.Job: Job job_local279539641_0001 failed with state FAILED due to: NA