0

I have a small Hadoop cluster version 1.1.2, while running some basic word counts on text files written in German I noticed that HDFS is not handling well special characters like ü,ö,ä etc.

Is there a way to change the CharacterSet used in HDFS ?

Here some examples of what I get here a "ö" is expected :

angeh�ren, angeh�rige, angeh�rigen, angeh�riger

eruh
  • 11
  • 2

1 Answers1

1

Since you mentioned the word count example, I guessed you were using Text. Text assumes the charset of the underlying content is UTF8. If your charset is not UTF8, you need to get byte[] from Text and convert by yourself.

I'm not sure if you are using the following codes (From hadoop wiki):

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
           String line = value.toString();
           StringTokenizer tokenizer = new StringTokenizer(line);
           while (tokenizer.hasMoreTokens()) {
               word.set(tokenizer.nextToken());
             context.write(word, one);
           }
}

In this case, you only need to change String line = value.toString(); to String line = new String(value.getBytes(), 0, value.getLength(), "change_to_your_charset");

By the way, HDFS is irrelevant to charset. It only stores binary data. "charset" is a concept that how to explain the binary data in a text file.

zsxwing
  • 20,270
  • 4
  • 37
  • 59