0

The setting of my problem is as follows:

In a client/server architecture including web service communication I get on the server side a CSV file from the client. The API gives me a org.apache.commons.fileupload.FileItem

Allowed codepages for those files are codepage 850 and codepage 1252.

Everything works properly, the only problem is the euro sign (€). In case of codepage 1252 my code isn't able to handle the Euro sign correctly. Instead of it I see the sign with the unicode U+00A4: ¤ when I print it to the console in Eclipse.

Currently I use the following code. It is spread over some classes. I've extracted the lines that are relevant.

byte[] inputData = call.getImportDatei().get();

// the following method works correctly
// it returns Charset.forName("CP850") or Charset.forName("CP1252")
final Charset charset = retrieveCharset(inputData);

char[] stringContents;
final StringBuffer sb = new StringBuffer();

final String s = new String(inputData, charset.name());

// here I see the problem with the euro sign already
// the following code shouldn't be the problem

// here some special characters are converted, but this doesn't affect the problem, so I removed those lines
stringContents = s.toCharArray();
for(final char c : stringContents){
  sb.append(c);
}
final Reader stringReader = new StringReader(sb.toString());


// org.supercsv.io.CsvListReader
CsvListReader reader = new CsvListReader(stringReader, CsvPreference.EXCEL_NORTH_EUROPE_PREFERENCE);
// now this reader is used to read the CSV content...

I tried different stuff:

FileItem.getInputStream()

I used FileItem.getInputStream() to get the byte[] but the result was the same.

FileItem.getString()

When I use FileItem.getString() it works perfectly with codepage 1252: The euro sign is read correctly. I see it when I print it to the console in Eclipse. But with code page 850 many special characters are wrong.

FileItem.getString(String encoding)

So my idea was to use FileItem.getString(String encoding). But all Strings that I tried to tell him to use codepage 1252 produced no exceptions but wrong results.

e.g. getString(Charset.forName("CP1252").name()) leads to a question mark instead of the euro sign.

How do I specify the encoding when I use org.apache.commons.fileupload.FileItem?

Or is this the wrong way?

Thanks for your help in advance!

Community
  • 1
  • 1
Steffzilla
  • 75
  • 3
  • 9
  • Uhm, `stringContent.toString()` will definitely not do what you think, since `stringContent` is a `char[]`... – fge Jul 24 '13 at 14:15
  • Also, you tell you see `¤`, but where? Console? Resulting text file or whatever? – fge Jul 24 '13 at 14:19
  • Finally (sorry for the number of questions, but this can't make it into an answer), what is `retrieveCharset()`? – fge Jul 24 '13 at 14:20
  • You are right. I have to correct my post... – Steffzilla Jul 24 '13 at 14:22
  • retrieveCharset() uses the header of the CSV file. The header contains a German character that is encoded differently in codepage 850 and 1252. With this trick I'm able to distinguish 850 and 1252 reliably. It returns java.nio.charset.Charset – Steffzilla Jul 24 '13 at 14:35
  • Did you know that `Charset` has a `.decode()` method? – fge Jul 24 '13 at 14:40
  • Can you add the code of the `retrieveCharset` method? – fge Jul 24 '13 at 14:53
  • Thank you for pointing to `decode()`. I tried this: `final CharBuffer cb = charset.decode(ByteBuffer.wrap(rawContents)); final String s = cb.toString();` But the result was a question mark in the console. – Steffzilla Jul 24 '13 at 15:23
  • OK, so the output is a console then. Is that console "euro-capable"? That is, do you use a font which can actually display the euro sign? – fge Jul 24 '13 at 15:26

1 Answers1

1

I see it when I print it to the console in Eclipse. But with code page 850 may special characters are wrong.

You're being misled by focusing too much to the results presented by the Eclipse console. The underlying data is correct, but Eclipse presented it wrongly. On Windows, it's by default configured to use cp1252 to present the characters printed by System.out.println(). This way the characters which were originally decoded with a different charset would obviously not be presented correctly.

You'd better reconfigure the Eclipse console to use UTF-8 to present those characters. UTF-8 covers every single character the world is aware of. You can do that by setting the Window > Preferences > General > Workspace > Text File Encoding proprety to UTF-8.

Then, given that you're apparently using FileItem from Apache Commons FileUpload, you could obtain the FileItem content as properly encoded Reader in a much simpler way as follows:

byte[] content = fileItem.get();
Charset charset = retrieveCharset(content); // No idea what you're doing there, but kudos that it's returning the right charset.
Reader reader = new InputStreamReader(new ByteArrayInputStream(content), charset);
// ...

Note that, when you intend to write this CSV afterwards to a character based output stream other than System.out.println(), such as FileWriter, then don't forget to explicitly specify set the charset to UTF-8 as well! You could do that in OutputStreamWriter. Otherwise, the platform default encoding will still be used, which is cp1252 in Windows.

See also:

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • BalusC, thank you for your answer! It's a very good point that the Eclipse console is using cp1252 by default. That shouldn't be a problem when the input comes as cp1252 too, but in the case of cp850, that's definitely a problem. Our complete code is saved using cp1252. It would cause too many changes to set the Text File Encoding property to UTF-8. I tried your code, but the result is unfortunately the same: The euro is shown as ? on console. In the UI it isn't shown at all. I don't understand why the euro causes problems, but the Yen sign, which is very close in the codepage, works perfectly – Steffzilla Jul 29 '13 at 13:59
  • In other words, you *still* didn't set the Eclipse text file encoding? Well, the you'll have to live with the problem. Just bite the bullet and take the lessons learnt to not make the same mistake anymore in future projects. – BalusC Jul 29 '13 at 14:24
  • I had a similar problem, where my CSV had foreign characters. They existed fine in the CSV (as observable in a plain text editor, but not something like Excel which itself can mar the special chars). But they were marred upon import using InputStreamReader with no charset specified in its optional Constructor 2nd param. Cobnstructing it then instead as BalusC suggested (Reader reader = new InputStreamReader(new ByteArrayInputStream(content), charset);) solved my problem :) – cellepo Jan 22 '14 at 23:11