28

What are the advantages of using NullWritable for null keys/values over using null texts (i.e. new Text(null)). I see the following from the «Hadoop: The Definitive Guide» book.

NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes are written to, or read from, the stream. It is used as a placeholder; for example, in MapReduce, a key or a value can be declared as a NullWritable when you don’t need to use that position—it effectively stores a constant empty value. NullWritable can also be useful as a key in SequenceFile when you want to store a list of values, as opposed to key-value pairs. It is an immutable singleton: the instance can be retrieved by calling NullWritable.get()

I do not clearly understand how the output is written out using NullWritable? Will there be a single constant value in the beginning output file indicating that the keys or values of this file are null, so that the MapReduce framework can ignore reading the null keys/values (whichever is null)? Also, how actually are null texts serialized?

Thanks,

Venkat

Venk K
  • 1,157
  • 5
  • 14
  • 25

3 Answers3

24

The key/value types must be given at runtime, so anything writing or reading NullWritables will know ahead of time that it will be dealing with that type; there is no marker or anything in the file. And technically the NullWritables are "read", it's just that "reading" a NullWritable is actually a no-op. You can see for yourself that there's nothing at all written or read:

NullWritable nw = NullWritable.get();
ByteArrayOutputStream out = new ByteArrayOutputStream();
nw.write(new DataOutputStream(out));
System.out.println(Arrays.toString(out.toByteArray())); // prints "[]"

ByteArrayInputStream in = new ByteArrayInputStream(new byte[0]);
nw.readFields(new DataInputStream(in)); // works just fine

And as for your question about new Text(null), again, you can try it out:

Text text = new Text((String)null);
ByteArrayOutputStream out = new ByteArrayOutputStream();
text.write(new DataOutputStream(out)); // throws NullPointerException
System.out.println(Arrays.toString(out.toByteArray()));

Text will not work at all with a null String.

Joe K
  • 18,204
  • 2
  • 36
  • 58
  • Thanks Joe for your time and reply. Now, I understand how NullWritable works. With regards to null text, I am sorry, I wanted to talk about having keys/values as Text and then doing a context.write(null, value) (assume that the key is text). – Venk K Apr 24 '13 at 18:55
  • That should also throw a NullPointerException. null keys and values do not work. If you really need a null key or value, you should consider some other representation for that, such as an empty string, or -1. – Joe K Apr 24 '13 at 19:14
  • 2
    `context.write(null, value)` will actually work for some output formats (TextOutputFormat for example will just output the value without the key and the configured delimiter) – Chris White Apr 24 '13 at 19:55
  • Thanks once again Joe for your time and reply. In the case I described, it actually does not throw a NPE. We have a reducer that derives from Reducer and in the reduce function we have code that is context.write(null, value) where value is some non null Text. When we look at the output files produced from the reducer, we see only the values and no keys. We run another MR after this where in we read the key-value pair as . Perhaps, it works because we read the keys as a LongWritable and the line number gets passed as the key. – Venk K Apr 24 '13 at 20:26
  • @ChrisWhite: I dont see any delimiter Chris. I see only the values. – Venk K Apr 24 '13 at 20:29
  • My mistake, I was thinking the `context.write(null, value)` we were talking about was happening in the mapper. However, I also did not know that it does sometimes work for reducer output. If you're using TextInputFormat for the next job, this will work regardless of the input... the LongWritable key is simply the byte offset, and the Text is the actual text line. – Joe K Apr 24 '13 at 20:31
  • @JoeK: If the context.write(null, value) happens in the mapper, we need to set the num of reduce tasks to 0 right, because there are no keys to reduce on? Also, in reduce, there seems to be no advantages of using NullWritable as compared to using context.write(null, value)? – Venk K Apr 24 '13 at 20:38
  • my comment reads badly, but i meant if you pass a null key, you'll get no key and no delimiter in the text output (similarly if you pass a key and null value, you'll get no delimiter and no value in the output) – Chris White Apr 24 '13 at 21:00
  • @ChrisWhite: Thanks for the clarification. Sorry, I also misread your earlier comment. I can imagine a case where we can use NullWritable for a map output key and then have a reducer. We can have a mapper produce a null reducer key a non null value and have a single reducer read all the values produced by the mapper. I do not think this can be achieved having the output Text as the text unless you decide to put some text like "" or some predefined value into it. In both the cases, the keys will occupy space, even for "" since the UTF-8 encoding for "" is not zero length. – Venk K Apr 24 '13 at 21:04
  • 3
    You are correct. If all values need to go to one single reducer, standard practice is to use NullWritable since it does not use up any space. In fact, this is probably the most common use of NullWritable. – Joe K Apr 24 '13 at 21:52
  • @JoeK: Thanks for the clarification Joe. And the byte offset you referred, can you please tell me what is that? – Venk K Apr 24 '13 at 23:19
  • When using TextInputFormat, the key you are given is the position in the file (in bytes) of the line that is the value. – Joe K Apr 25 '13 at 18:34
0

I change the run method. and success

@Override
public int run(String[] strings) throws Exception {
    Configuration config = HBaseConfiguration.create();  
    //set job name
    Job job = new Job(config, "Import from file ");
    job.setJarByClass(LogRun.class);
    //set map class
    job.setMapperClass(LogMapper.class);

    //set output format and output table name
    //job.setOutputFormatClass(TableOutputFormat.class);
    //job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, "crm_data");
    //job.setOutputKeyClass(ImmutableBytesWritable.class);
    //job.setOutputValueClass(Put.class);

    TableMapReduceUtil.initTableReducerJob("crm_data", null, job);
    job.setNumReduceTasks(0);
    TableMapReduceUtil.addDependencyJars(job);

    FileInputFormat.addInputPath(job, new Path(strings[0]));

    int ret = job.waitForCompletion(true) ? 0 : 1;
    return ret;
}
zwj0571
  • 61
  • 1
  • 5
0

You can always wrap your string in your own Writable class and have a boolean indicating it has blank strings or not:

@Override
public void readFields(DataInput in) throws IOException { 
    ...
    boolean hasWord = in.readBoolean();
    if( hasWord ) {
        word = in.readUTF();
    }
    ...
}

and

@Override
public void write(DataOutput out) throws IOException {
    ...
    boolean hasWord = StringUtils.isNotBlank(word);
    out.writeBoolean(hasWord);
    if(hasWord) {
        out.writeUTF(word);
    }
    ...
}
Arthur B
  • 31
  • 2