1

I have extended WritableComparable and want to store it as mapper as the mapper value.

public class SenderRecieverPair implements WritableComparable<BinaryComparable> {

    Set<InternetAddress> pair = new TreeSet<InternetAddress>(new Comparator<InternetAddress>() {

        @Override
        public int compare(InternetAddress add1, InternetAddress add2) {
            return add1.getAddress().compareToIgnoreCase(add2.getAddress());
        }

    });

    public SenderRecieverPair() {
        super();
    }

    public SenderRecieverPair(InternetAddress add1, InternetAddress add2) {
        super();
        pair.add(add1);
        pair.add(add1);
    }


    public Set<InternetAddress> getPair() {
        return pair;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        for (Iterator<InternetAddress> iterator = pair.iterator(); iterator.hasNext();) {
            InternetAddress email = (InternetAddress) iterator.next();
            String mailAddress = email.getAddress();
            if(mailAddress == null) {
                mailAddress = "";
            }
            byte[] address = mailAddress.getBytes("UTF-8");
            WritableUtils.writeVInt(out, address.length);
            out.write(address, 0, address.length);
            String displayName = email.getPersonal();
            if(displayName == null) {
                displayName = "";
            }
            byte[] display = displayName.getBytes("UTF-8");
            WritableUtils.writeVInt(out, display.length);
            out.write(display, 0, display.length);
        }

    }

    @Override
    public void readFields(DataInput in) throws IOException {
        for (int i = 0; i < 2; i++) {
            int length = WritableUtils.readVInt(in);
            byte[] container = new byte[length];
            in.readFully(container, 0, length);
            String mailAddress = new String(container, "UTF-8");
            length = WritableUtils.readVInt(in);
            container = new byte[length];
            in.readFully(container, 0, length);
            String displayName = new String(container, "UTF-8");
            InternetAddress address = new InternetAddress(mailAddress, displayName);
            pair.add(address);
        }

    }

    @Override
    public int compareTo(BinaryComparable o) {
        // TODO Auto-generated method stub
        return 0;
    }

}

However i am getting the below error. Please help me understand and correct this

2013-07-29 06:49:26,753 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
    2013-07-29 06:49:26,891 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId=
    2013-07-29 06:49:27,004 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
    2013-07-29 06:49:27,095 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
    2013-07-29 06:49:27,095 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
    2013-07-29 06:49:27,965 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
    2013-07-29 06:49:27,988 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
    2013-07-29 06:49:27,991 WARN org.apache.hadoop.mapred.Child: Error running child
    java.lang.RuntimeException: java.io.EOFException
        at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:128)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:967)
        at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:30)
        at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:83)
        at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1253)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:581)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:648)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.mapred.Child.main(Child.java:262)
    Caused by: java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:250)
        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:299)
        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:320)
        at com.edureka.sumit.enron.datatype.SenderRecieverPair.readFields(SenderRecieverPair.java:68)
        at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:122)
        ... 14 more
    2013-07-29 06:49:27,993 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task

Thanks

S Kr
  • 1,831
  • 2
  • 25
  • 50

3 Answers3

2

Is that on purpose ?

public SenderRecieverPair(InternetAddress add1, InternetAddress add2) {
    super();
    pair.add(add1);
    pair.add(add1);
}

You are adding add1 twice therefore in the write loop you get only 1 element out of the set instead of two

1

Couple of observations:

  • If you know you're using a pair in SenderRecieverPair then i wouldn't use a Set - explicitly store the two objects as instance variables. The set allows you to inadvertently add extra values to the set and your write method will write out 0, 1, 2 or more, depending on the set size (your readFields method explicitly expects 2 in the for loop).
  • Secondly, if you do stick with using a set you should know that hadoop re-uses the object instance between calls to your map / reduce task. This means that the actual object reference will be the same for each invocation of your map / reduce method, it's just the underlying contents will change via a call to readFields. In your case your don't call pair.clear() as the first part of your readFields method, meaning that the set will continue to grow between calls.
  • Finally, use Text objects in your InternetAddress class to store the email address and display name, then serialization is much simpler as you can delegate t the object, which can delegate to the Text Objects:

For example:

public class InternetAddress implements WritableComparable<InternetAddress> {
    protected Text emailAddress = new Text();
    protected Text displayName = new Text();

    // getter and setters for the above two fields
    // ..

    // compareTo method
    // ..

    @Override
    public void write(DataOutput out) throws IOException {
        emailAddress.write(out);
        displayName.write(out);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        emailAddress.readFields(in);
        displayName.readFields(in);
    }
}

public class SenderRecieverPair implements WritableComparable<BinaryComparable> {
    protected Set<InternetAddress> pair = new TreeSet<InternetAddress>();

    // other methods omitted
    ..

    @Override
    public void write(DataOutput out) throws IOException {
        int safety = 0;
        for (Iterator<InternetAddress> iterator = pair.iterator(); iterator.hasNext();) {
          InternetAddress p1 = (InternetAddress) iterator.next();
          p1.write(out);

          p2 = (InternetAddress) iterator.next();
          p2.write(out);

          if (++safety == 3) {
              throw new IOException("More than two items in pair");
          }
        }
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        pair.clear();

        // Note a more efficient method would be to re-use the objects already in the set (which is even easier to do if you don't use a set and just store the two objects as instance variables)

        InternetAddress a1 = new InternetAddress();
        a1.readFields(in);
        pair.add(a1);

        InternetAddress a2 = new InternetAddress();
        a2.readFields(in);
        pair.add(a2);
    }
}

Oh and i don't see hashCode methods - you should definitely have these overridden if your using the HashPartitioner (default) and are passing these objects between mappers and reducers.

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • Hi Chris, It was a great explanation, but still i am getting the error. wondering what causes call to HadoopInternetAddress.readFields() in the mapper Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) WritableUtils.readVInt(WritableUtils.java:320) at Text.readFields(Text.java:263) at HadoopInternetAddress.readFields(HadoopInternetAddress.java:41) at com.edureka.sumit.enron.datatype.HadoopSenderRecieverPair.readFields(HadoopSenderRecieverPair.java:57) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:122) ... 13 more – S Kr Jul 29 '13 at 12:13
  • Ok, can you post a link to a pastebin / git gist etc listing of your complete class, or update your question with the full implementation of your WritableComparable class (and what's your value class, is that custom too?) – Chris White Jul 29 '13 at 23:57
  • Please find the link http://pastebin.com/N7Cw5ULD Basically there is a set of mails and i want to count the sender reciver pairs in that mail dump – S Kr Jul 30 '13 at 05:33
  • And the source for `EnronMailMapper.class`? – Chris White Jul 30 '13 at 10:30
  • And see Amir Pauker's answer, looks like you have a copy / paste error – Chris White Jul 30 '13 at 10:35
  • I am sharing the whole code http://www.filedropper.com/enron though its very unfair on my part to expect the solution, but please have a look. Amir's answer help me correct the logic but i am facing the same issue. I will change to use 2 variables instead of the set as you suggested – S Kr Jul 30 '13 at 19:49
0

java.io.EOFException exception is thrown if you attempt to read an additional object beyond the end of the file. So I think becuase you are looping in the readFields method that may be the reason behind your problem.

Binary01
  • 695
  • 5
  • 11
  • I am not sure if i am doing that. I am writing 2 internet addresses as byte array and reading them again 2 at a time to build my pair. I have double checked my code that it either enters both or enters none. Or is there a better approach to what i am doing – S Kr Jul 29 '13 at 10:57