1

I'm writing a MapReduce job to mine webserver logs. The input is from text files, output goes to a MySQL database. Problem is, if one record fails to insert, for whatever reason, like data exceeding column size, the whole job fails and nothing gets written to the database. Is there a way so that the good records are still persisted? I guess one way would be to validate the data but that couples the client with the database schema too much for my taste. I'm not posting the code because this is not particularly a code issue.

Edit:

Reducer:

protected void reduce(SkippableLogRecord rec,
        Iterable<NullWritable> values, Context context) {
    String path = rec.getPath().toString();
    path = path.substring(0, min(path.length(), 100));

    try {
        context.write(new DBRecord(rec), NullWritable.get());

        LOGGER.info("Wrote record {}.", path);
    } catch (IOException | InterruptedException e) {
        LOGGER.error("There was a problem when writing out {}.", path, e);
    }
}

Log:

15/03/01 14:35:06 WARN mapred.LocalJobRunner: job_local279539641_0001
java.lang.Exception: java.io.IOException: Data truncation: Data too long for column 'filename' at row 1
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.io.IOException: Data truncation: Data too long for column 'filename' at row 1
    at org.apache.hadoop.mapreduce.lib.db.DBOutputFormat$DBRecordWriter.close(DBOutputFormat.java:103)
    at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
15/03/01 14:35:06 INFO mapred.LocalJobRunner: reduce > reduce
15/03/01 14:35:07 INFO mapreduce.Job: Job job_local279539641_0001 failed with state FAILED due to: NA
Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
  • 2
    Hmm. What stops you from using a try catch? – axiom Mar 01 '15 at 17:54
  • @axiom The try-catch needs to be around the code that throws the exception. And that code ain't mine, it's Hadoop. – Abhijit Sarkar Mar 01 '15 at 19:13
  • It is a code issue. The code needs to be defensive to issues like this. Change the code to have a try/catch. – Donald Miner Mar 01 '15 at 20:05
  • OK, either I'm missing something or you guys are not familiar with how this works. I've updated my post with the reducer code that does the writing to the DB and with the failure logs. As you can see, there IS a try-catch. – Abhijit Sarkar Mar 01 '15 at 20:35
  • `but that couples the client with the database schema too much for my taste.` then use a text field, so you don't run into any char limitations. – Thomas Jungblut Mar 01 '15 at 21:18
  • @ThomasJungblut As I said in my question, I'm mining web logs. Where's a text field there? – Abhijit Sarkar Mar 01 '15 at 21:32
  • 1
    In your database obviously. – Thomas Jungblut Mar 01 '15 at 21:52
  • @ThomasJungblut Ah, you meant a varchar type DB field, not front end. Yes, that's what I'm doing now to get around the issue. Still, it'd be nice to say, "hey Hadoop, I know 1 record failed to insert, don't bother about it, keep doing your thing for the other ones". – Abhijit Sarkar Mar 01 '15 at 22:19
  • 1
    It's about efficiency, if you want high throughput you need to batch inserts into one statement. If one insert fails, it will fail the entire batch. You could adjust the outputformat/record writer accordingly. – Thomas Jungblut Mar 01 '15 at 22:24
  • I agree. Your answer is much relevant than the blanket "use try-catch" comments made before. I'd downvote those if I could. I posted an answer myself with some more information for anyone else having the same issue. – Abhijit Sarkar Mar 01 '15 at 22:28

1 Answers1

0

Answering my own question and looking at this SO post, I see that the database write in done in a batch, and on SQLException, the transaction is rolled back. So that explains my problem. I guess I'll just have to make the DB columns big enough, or validate first. I can also create a custom DBOutputFormat/DBRecordWriter but unless I insert one record at a time, there'll always be a risk of one bad record causing the whole batch to rollback.

public void close(TaskAttemptContext context) throws IOException {
  try {
      LOG.warn("Executing statement:" + statement);   

      statement.executeBatch();
    connection.commit();
  } catch (SQLException e) {
    try {
      connection.rollback();
    }
    catch (SQLException ex) {
      LOG.warn(StringUtils.stringifyException(ex));
    }
    throw new IOException(e.getMessage());
  } finally {
    try {
      statement.close();
      connection.close();
    }
    catch (SQLException ex) {
      throw new IOException(ex.getMessage());
    }
  }
}
Community
  • 1
  • 1
Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219