0

I'm writing a MR job to mine webserver logs. The input for the job is from text files, output goes to a MySQL database. Problem is, the job completes successfully but doesn't write anything to the DB. I've not done MR programming for a while so it most likely is a bug that I'm not able to find. It is not the pattern matching (see below), that I've unit tested and works fine. What am I missing? Mac OS X, Oracle JDK 1.8.0_31, hadoop-2.6.0 Note: The exceptions are logged, I've omitted them for brevity.

SkippableLogRecord:

public class SkippableLogRecord implements WritableComparable<SkippableLogRecord> {
    // fields

    public SkippableLogRecord(Text line) {
        readLine(line.toString());
    }
    private void readLine(String line) {
        Matcher m = PATTERN.matcher(line);

        boolean isMatchFound = m.matches() && m.groupCount() >= 5;

        if (isMatchFound) {
        try {
            jvm = new Text(m.group("jvm"));

            Calendar cal = getInstance();
            cal.setTime(new SimpleDateFormat(DATE_FORMAT).parse(m
            .group("date")));

            day = new IntWritable(cal.get(DAY_OF_MONTH));
            month = new IntWritable(cal.get(MONTH));
            year = new IntWritable(cal.get(YEAR));

            String p = decode(m.group("path"), UTF_8.name());

            root = new Text(p.substring(1, p.indexOf(FILE_SEPARATOR, 1)));
            filename = new Text(
            p.substring(p.lastIndexOf(FILE_SEPARATOR) + 1));
            path = new Text(p);

            status = new IntWritable(Integer.parseInt(m.group("status")));
            size = new LongWritable(Long.parseLong(m.group("size")));
        } catch (ParseException | UnsupportedEncodingException e) {
            isMatchFound = false;
        }
    }

    public boolean isSkipped() {
        return jvm == null;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        jvm.readFields(in);
        day.readFields(in);
        // more code
    }
    @Override
    public void write(DataOutput out) throws IOException {
        jvm.write(out);
        day.write(out);
        // more code
    }
    @Override
    public int compareTo(SkippableLogRecord other) {...}
    @Override
    public boolean equals(Object obj) {...}
}

Mapper:

public class LogMapper extends
    Mapper<LongWritable, Text, SkippableLogRecord, NullWritable> {    
    @Override
    protected void map(LongWritable key, Text line, Context context) {
        SkippableLogRecord rec = new SkippableLogRecord(line);

        if (!rec.isSkipped()) {
            try {
                context.write(rec, NullWritable.get());
            } catch (IOException | InterruptedException e) {...}
        }
    }
}

Reducer:

public class LogReducer extends
    Reducer<SkippableLogRecord, NullWritable, DBRecord, NullWritable> {    
    @Override
    protected void reduce(SkippableLogRecord rec,
        Iterable<NullWritable> values, Context context) {
        try {
            context.write(new DBRecord(rec), NullWritable.get());
        } catch (IOException | InterruptedException e) {...}
    }
}

DBRecord:

public class DBRecord implements Writable, DBWritable {
    // fields

    public DBRecord(SkippableLogRecord logRecord) {
        jvm = logRecord.getJvm().toString();
        day = logRecord.getDay().get();
        // more code for rest of the fields
    }

    @Override
    public void readFields(ResultSet rs) throws SQLException {
        jvm = rs.getString("jvm");
        day = rs.getInt("day");
        // more code for rest of the fields
    }

    @Override
    public void write(PreparedStatement ps) throws SQLException {
        ps.setString(1, jvm);
        ps.setInt(2, day);
        // more code for rest of the fields
    }
}

Driver:

public class Driver extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();

        DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", // driver
        "jdbc:mysql://localhost:3306/aac", // db url
        "***", // user name
        "***"); // password

        Job job = Job.getInstance(conf, "log-miner");

        job.setJarByClass(getClass());
        job.setMapperClass(LogMapper.class);
        job.setReducerClass(LogReducer.class);
        job.setMapOutputKeyClass(SkippableLogRecord.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(DBRecord.class);
        job.setOutputValueClass(NullWritable.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(DBOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));

        DBOutputFormat.setOutput(job, "log", // table name
        new String[] { "jvm", "day", "month", "year", "root",
            "filename", "path", "status", "size" } // table columns
        );

        return job.waitForCompletion(true) ? 0 : 1;
    }
    public static void main(String[] args) throws Exception {
        GenericOptionsParser parser = new GenericOptionsParser(
        new Configuration(), args);

        ToolRunner.run(new Driver(), parser.getRemainingArgs());
    }
}

Job execution log:

15/02/28 02:17:58 INFO mapreduce.Job:  map 100% reduce 100%
15/02/28 02:17:58 INFO mapreduce.Job: Job job_local166084441_0001 completed successfully
15/02/28 02:17:58 INFO mapreduce.Job: Counters: 35
    File System Counters
        FILE: Number of bytes read=37074
        FILE: Number of bytes written=805438
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=476788498
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=11
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Map-Reduce Framework
        Map input records=482230
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=12
        Input split bytes=210
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=12
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=150
        Total committed heap usage (bytes)=1381498880
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=171283337
    File Output Format Counters 
        Bytes Written=0
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
  • Have you tried without Hadoop? Only use it as last resort if you have a working process that doesn't scale. Get rid of all the `new` calls in your inner loop - also no new `Matcher`. These are very costly. And don't ignore exceptions... most likely, you just fail parsing every single line... – Has QUIT--Anony-Mousse Feb 28 '15 at 17:32
  • @Anony-Mousse As I said, the parsing works because I've a unit test for it. The exceptions are not really ignored, I've not shown those for brevity. And lastly, I want to make the program work first, then worry about scaling. A program that scales beautifully but does nothing is not worth a dime. – Abhijit Sarkar Feb 28 '15 at 17:35
  • 1
    A unit test within mapreduce, or with other data types outside? Apparently, your map yields 0 records! So it must be skipping everything. Also, design for memory right away, instead of rewriting it later again... follow best practises. `Text` for example exists because `String` is too expensive, and `IntWritable` is a _reusable_ `Integer` – Has QUIT--Anony-Mousse Feb 28 '15 at 17:48
  • @Anony-Mousse JUnit tests that send a `Text` to the `SkippableLogRecord` and verify that a match is obtained. Negative tests too. Nothing to do with MR or Hadoop in those tests, except that I use the `Text` datatype. – Abhijit Sarkar Feb 28 '15 at 17:52
  • 1
    Maybe including/not including the newline, for example. Either way, your lines do not match as far as I can tell. – Has QUIT--Anony-Mousse Feb 28 '15 at 17:55
  • You mean the tests find matches but the real program doesn't? OK, let me dig that angle. – Abhijit Sarkar Feb 28 '15 at 18:03
  • @Anony-Mousse Whitespace was the problem. Your comments pointed me in the right direction. There were several other issues that I'd to solve but the code works now. If you write an answer, I'll accept it. Sadly, there's no accepting comments. For anyone interested, here's the final program: https://github.com/abhijitsarkar/java/tree/master/log-miner-hadoop – Abhijit Sarkar Mar 01 '15 at 03:33

1 Answers1

0

To answer my own question, the issue was leading whitespaces which caused the matcher to fail. The unit test didn't test with leading whitespaces but the actual logs had those for some reason. Another issue with the code posted above was that all the fields in the class were initialized in the readLine method. As @Anony-Mousse had mentioned, this is expensive because Hadoop datatypes are designed to be reused. It also caused a bigger problem with serialization and deserialization. When Hadoop tried to reconstruct the class by calling readFields, it caused a NPE because all the fields were null. I also made other minor improvements using some Java 8 classes and syntax. In the end, even though I got it working, I rewrote the code using Spring Boot, Spring Data JPA and Spring's support for asynchronous processing using @Async.

Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219