Read content including the euro sign from a FileItem with codepage 1252

Question

The setting of my problem is as follows:

In a client/server architecture including web service communication I get on the server side a CSV file from the client. The API gives me a org.apache.commons.fileupload.FileItem

Allowed codepages for those files are codepage 850 and codepage 1252.

Everything works properly, the only problem is the euro sign (€). In case of codepage 1252 my code isn't able to handle the Euro sign correctly. Instead of it I see the sign with the unicode U+00A4: ¤ when I print it to the console in Eclipse.

Currently I use the following code. It is spread over some classes. I've extracted the lines that are relevant.

byte[] inputData = call.getImportDatei().get();

// the following method works correctly
// it returns Charset.forName("CP850") or Charset.forName("CP1252")
final Charset charset = retrieveCharset(inputData);

char[] stringContents;
final StringBuffer sb = new StringBuffer();

final String s = new String(inputData, charset.name());

// here I see the problem with the euro sign already
// the following code shouldn't be the problem

// here some special characters are converted, but this doesn't affect the problem, so I removed those lines
stringContents = s.toCharArray();
for(final char c : stringContents){
  sb.append(c);
}
final Reader stringReader = new StringReader(sb.toString());


// org.supercsv.io.CsvListReader
CsvListReader reader = new CsvListReader(stringReader, CsvPreference.EXCEL_NORTH_EUROPE_PREFERENCE);
// now this reader is used to read the CSV content...

I tried different stuff:

FileItem.getInputStream()

I used FileItem.getInputStream() to get the byte[] but the result was the same.

FileItem.getString()

When I use FileItem.getString() it works perfectly with codepage 1252: The euro sign is read correctly. I see it when I print it to the console in Eclipse. But with code page 850 many special characters are wrong.

FileItem.getString(String encoding)

So my idea was to use FileItem.getString(String encoding). But all Strings that I tried to tell him to use codepage 1252 produced no exceptions but wrong results.

e.g. getString(Charset.forName("CP1252").name()) leads to a question mark instead of the euro sign.

How do I specify the encoding when I use org.apache.commons.fileupload.FileItem?

Or is this the wrong way?

Thanks for your help in advance!

Uhm, `stringContent.toString()` will definitely not do what you think, since `stringContent` is a `char[]`... — fge, Jul 24 '13 at 14:15
Also, you tell you see `¤`, but where? Console? Resulting text file or whatever? — fge, Jul 24 '13 at 14:19
Finally (sorry for the number of questions, but this can't make it into an answer), what is `retrieveCharset()`? — fge, Jul 24 '13 at 14:20
retrieveCharset() uses the header of the CSV file. The header contains a German character that is encoded differently in codepage 850 and 1252. With this trick I'm able to distinguish 850 and 1252 reliably. It returns java.nio.charset.Charset — Steffzilla, Jul 24 '13 at 14:35
Thank you for pointing to `decode()`. I tried this: `final CharBuffer cb = charset.decode(ByteBuffer.wrap(rawContents)); final String s = cb.toString();` But the result was a question mark in the console. — Steffzilla, Jul 24 '13 at 15:23
OK, so the output is a console then. Is that console "euro-capable"? That is, do you use a font which can actually display the euro sign? — fge, Jul 24 '13 at 15:26

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

I see it when I print it to the console in Eclipse. But with code page 850 may special characters are wrong.

You're being misled by focusing too much to the results presented by the Eclipse console. The underlying data is correct, but Eclipse presented it wrongly. On Windows, it's by default configured to use cp1252 to present the characters printed by System.out.println(). This way the characters which were originally decoded with a different charset would obviously not be presented correctly.

You'd better reconfigure the Eclipse console to use UTF-8 to present those characters. UTF-8 covers every single character the world is aware of. You can do that by setting the Window > Preferences > General > Workspace > Text File Encoding proprety to UTF-8.

Then, given that you're apparently using FileItem from Apache Commons FileUpload, you could obtain the FileItem content as properly encoded Reader in a much simpler way as follows:

byte[] content = fileItem.get();
Charset charset = retrieveCharset(content); // No idea what you're doing there, but kudos that it's returning the right charset.
Reader reader = new InputStreamReader(new ByteArrayInputStream(content), charset);
// ...

Note that, when you intend to write this CSV afterwards to a character based output stream other than System.out.println(), such as FileWriter, then don't forget to explicitly specify set the charset to UTF-8 as well! You could do that in OutputStreamWriter. Otherwise, the platform default encoding will still be used, which is cp1252 in Windows.

Read content including the euro sign from a FileItem with codepage 1252

FileItem.getInputStream()

FileItem.getString()

FileItem.getString(String encoding)

1 Answers1

See also: