Serialization using ArrayWritable seems to work in a funny way

Question

I was working with ArrayWritable, at some point I needed to check how Hadoop serializes the ArrayWritable, this is what I got by setting job.setNumReduceTasks(0):

0    IntArrayWritable@10f11b8
3    IntArrayWritable@544ec1
6    IntArrayWritable@fe748f
8    IntArrayWritable@1968e23
11    IntArrayWritable@14da8f4
14    IntArrayWritable@18f6235

and this is the test mapper that I was using:

public static class MyMapper extends Mapper<LongWritable, Text, LongWritable, IntArrayWritable> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int red = Integer.parseInt(value.toString());
        IntWritable[] a = new IntWritable[100];

        for (int i =0;i<a.length;i++){
            a[i] = new IntWritable(red+i);
        }

        IntArrayWritable aw = new IntArrayWritable();
        aw.set(a);
        context.write(key, aw);
    }
}

IntArrayWritable is taken from the example given in the javadoc: ArrayWritable.

import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.IntWritable;

public class IntArrayWritable extends ArrayWritable {
    public IntArrayWritable() {
        super(IntWritable.class);
    }
}

I actually checked on the source code of Hadoop and this makes no sense to me. ArrayWritable should not serialize the class name and there is no way that an array of 100 IntWritable can be serialized using 6/7 hexadecimal values. The application actually seems to work just fine and the reducer deserializes the right values... What is happening? What am I missing?

`IntArrayWritable@10f11b8` seems like aw.toString() to me. can you please post the code where you get `IntArrayWritable@10f11b8`. I guess the problem is you are not getting the "serialized" data but the object's toString method. — frail, Oct 27 '11 at 17:12
I added the IntArrayWritable class. It should inherit from ArrayWritable the serialization methods, specifically public void write(DataOutput out). I agree that the output seems a toString() but I don't know why this is happening. — igon, Oct 28 '11 at 09:57
I'm also facing the same issue when working with IntArrayWritable. What's the exact solution for this. — akhil.cs, Feb 23 '15 at 08:35

score 7 · Answer 1 · edited Mar 12 '13 at 21:12

You have to override the default toString() method.

It's called by the TextOutputFormat to create a human readable format.

Try out the following code and see the result:

public class IntArrayWritable extends ArrayWritable {
    public IntArrayWritable() {
        super(IntWritable.class);
    }

    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder();
        for (String s : super.toStrings())
        {
            sb.append(s).append(" ");
        }
        return sb.toString();
    }
}

score 4 · Accepted Answer · answered Oct 27 '11 at 18:17

4

The problem is that the output you are getting from your MapReduce job is not the serialized version of that data. It is something that is translated into a pretty printed string.

When you set the number of reducers to zero, your mappers now get passed through a output format, which will format your data, likely converting it to a readable string. It does not dump it out serialized as if it was going to be picked up by a reducer.

answered Oct 27 '11 at 18:17

Donald Miner

38,889
8
95
118

1

I tried to run the computation with no reducer or with the identity reducer and the result is the same. Moreover I'm pretty sure that setting job.setNumReduceTasks(0) should produce the serialization of intermediate values as output. – igon Oct 28 '11 at 10:03
"Moreover I'm pretty sure that setting job.setNumReduceTasks(0) should produce the serialization of intermediate values as output." -- I'm very sure it doesn't. It formats it. – Donald Miner Oct 28 '11 at 11:35
Yes it does. The identity reducer is going to do the same thing. It is going to format the output as it writes it out. Intermediary steps are going to have it serialized, but you never see that unless you explicitly serialize it and write that out. There is a huge difference between converting something to a string and serializing an object. – Donald Miner Oct 28 '11 at 19:01
ok after enormous pain I understood what you were saying.. Is it possible to specify an Input/OutputFormat that actually use the serialization of the object? I have to perform multiple steps of map reduce so the standard FileOutputFormat is quite inefficient for my needs. – igon Nov 04 '11 at 16:44

score 3 · Answer 3 · edited Jan 24 '12 at 21:54

3

Have you looked into SequenceFileInputFormat and SequenceFileOutputFormat ? you can set those up with:

job.setInputFormatClass(SequenceFileInputFormat.class);

and

job.setOutputFormatClass(TextOutputFormat.class);

edited Jan 24 '12 at 21:54

Udo Held

12,314
11
67
93

answered Jan 24 '12 at 21:14

jmp

31
1

score 0 · Answer 4 · answered Nov 12 '14 at 09:19

It's very simple. Hadoop uses thé method write (DataOutput out) to write the object in a serialized version (see hadoop ArrayWritable doc for more information). When you extend ArrayWritable by IntArrayWritable your own class will use these methods from the inherited class. Bye.

Serialization using ArrayWritable seems to work in a funny way

4 Answers4

Linked