I am trying to write a Java app that will run on a linux server but that will process files generated on legacy Windows machines using cp-1252 as the character set. Is there anyway to encode these files as utf-8 instead of the cp-1252 it is generated as?
-
1This question is not answerable as posted... it depends entirely on what is being used to generate these files (and you didn't tell us). If it's Excel 2007, then then answer is no. – theglauber Aug 20 '12 at 21:59
-
1However, Java should be able to process these Windows files fine, given the correct encoding parameters. – theglauber Aug 20 '12 at 22:00
-
Thanks @theglauber (+2) - can you explain why Excel 2007 would be a dealbreaker? Also, can you give an example of correct encoding parameters? Thanks again! – IAmYourFaja Aug 20 '12 at 22:02
-
Just speaking from experience and frustration. You can't specify the encoding for a csv file in Excel 2007. In Java, you would use a InputStreamReader with the correct encoding ("Windows-1252") built on top of a FileInputStream. – theglauber Aug 20 '12 at 22:06
-
Thanks @theglauber - can you please see my comment underneath Eric Grunzke's answer. Does your recommendation above solve my problem? – IAmYourFaja Aug 20 '12 at 22:08
-
No, if your problem is with the file name itself, then see [Joni Salomen's answer](http://stackoverflow.com/a/12057138/1118101) below for a suggestion to set the locale for the Java process. – theglauber Aug 21 '12 at 19:11
2 Answers
If the file names as well as content is a problem, the easiest way to solve the problem is setting the locale
on the Linux machine to something based on ISO-8859-1
rather than UTF-8
. You can use locale -a
to list available locales. For example if you have en_US.iso88591
you could use:
export LANG=en_US.iso88591
This way Java will use ISO-8859-1 for file names, which is probably good enough. To run the Java program you still have to set the file.encoding
system property:
java -Dfile.encoding=cp1252 -cp foo.jar:bar.jar blablabla
If no ISO-8859-1 locale is available you can generate one with localedef
. Installing it requires root access though. In fact, you could generate a locale that uses CP-1252, if it is available on your system. For example:
sudo localedef -f CP1252 -i en_US en_US.cp1252
export LANG=en_US.cp1252
This way Java should use CP1252 by default for all I/O, including file names.
Expanded further here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/

- 108,737
- 14
- 143
- 193
You can read and write text data in any encoding that you wish. Here's a quick code example:
public static void main(String[] args) throws Exception
{
// List all supported encodings
for (String cs : Charset.availableCharsets().keySet())
System.out.println(cs);
File file = new File("SomeWindowsFile.txt");
StringBuilder builder = new StringBuilder();
// Construct a reader for a specific encoding
Reader reader = new InputStreamReader(new FileInputStream(file), "windows-1252");
while (reader.ready())
{
builder.append(reader.read());
}
reader.close();
String string = builder.toString();
// Construct a writer for a specific encoding
Writer writer = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
writer.write(string);
writer.flush();
writer.close();
}
If this still 'chokes' on read, see if you can verify that the the original encoding is what you think it is. In this case I've specified windows-1252, which is the java string for cp-1252.

- 1,487
- 15
- 21
-
Thanks @Eric Grunzke (+1) - a part of the problem is that occasionally the file names themselves (i.e. `SomeWindowsFile.txt`) contains a CP-1252 character that makes the Java `Reader` choke. So the real question is: *how do you read a file whose filename makes Java choke because of an "illegal" character?* Thanks again! – IAmYourFaja Aug 20 '12 at 22:07
-
You better hope this is run on Windows, since CP-1252 is more than likely *not* going to be the default text file encoding in other contexts. Better to use `new InputStreamReader(new FileInputStream(file, "Win1252"))` – obataku Aug 20 '12 at 22:31
-
@4herpsand7derpsago how does it make `Reader` choke exactly? Can you demonstrate using an [SSCCE](http://sscce.org)? – obataku Aug 20 '12 at 22:33
-
I updated the code example to show how to force an encoding in the Reader. Veer's question is a good one: I am curious what you mean by "choke" and if this fixes that problem. – Eric Grunzke Aug 21 '12 at 15:19
-
I'm sorry, I misread your comment. You're having problems with unusual characters in the file *name*, not the file *data*. That is trickier. I'd suggest trying Joni's solution of setting -Dfile.encoding=windows-1252. Also, you could try new File("the/parent/dir").list() and see if Java is interpreting the filename in a different way. – Eric Grunzke Aug 21 '12 at 18:06