icu4j: read and write files with differenz charsets

Question

i am using java to parse a folder, and read the files. In the folder are only txt-files. But with different charsets. Some of them are in ISO-8859-1 and some of them are in windows-1252.

I need to read the file and create one single file from all. So i append the content. See my code:

File fiout = new File("single_"+System.currentTimeMillis()+".csv");
PrintWriter writer = new PrintWriter(fiout);
for( int x=0; x < all_zipEntries.size(); x++ ){
    File fi = (File)all_zipEntries.get( x );
    String zipfilename = fi.getName();
                
    String charset = getCharset(fi);
    Charset inputCharset = Charset.forName(charset);
                    
    log.println("Read "+zipfilename+" ... (Charset "+charset+" ... "+inputCharset.toString()+")");
                    
    FileInputStream fis = new FileInputStream(fi.getName());
    InputStreamReader isr = new InputStreamReader(fis, inputCharset);
    BufferedReader in = new BufferedReader(isr);
    while ( in.ready() ) {
        String row = in.readLine(); 
        writer.println(row);
    }
    in.close();
    isr.close();
    fis.close();
}
writer.close();

This is my log:

Read 01.csv ... (Charset ISO-8859-1 ... ISO-8859-1)
Read 02.csv ... (Charset ISO-8859-1 ... ISO-8859-1)
Read 03.csv ... (Charset windows-1252 ... windows-1252)
Read 04.csv ... (Charset windows-1252 ... windows-1252)
Read 05.csv ... (Charset windows-1252 ... windows-1252)
Read 06.csv ... (Charset windows-1252 ... windows-1252)
Read 07.csv ... (Charset windows-1252 ... windows-1252)
Read 08.csv ... (Charset windows-1252 ... windows-1252)
Read 09.csv ... (Charset windows-1252 ... windows-1252)

You see the first 2 files are ISO coded, the last are windows-1252

My default charset is ISO-8859-1. In the result file that was createt by the code above i have some lines with

Äpfel
Äpfel
Äpfel

and i have lines like

?pfel
?pfel

The last one are from the files 3 till 9. It seems to me he did not convert from windows-1252 to ISO correctly. But i set the charset at reading!

But, you're writing to 'fiout' with a ```Writer``` that is using the *system default* character encoding. So *how* are you viewing those files? It might be useful to know what that is, so: ```System.out.println(System.getProperty("file.encoding"));``` — g00se, Oct 09 '22 at 12:20
So, what you *should* probably be doing is standardizing by writing to *one* output encoding that can encompass the various encodings that you've been reading. That output encoding should probably be UTF-8, so you need to *write* using that encoding — g00se, Oct 09 '22 at 12:39
I like to write to ISO-8859-1 all cheracters i need are available. I am viewing the file with the notepad. In the first line the characters will display correctly in later lines there is a ? instad of a Ä — Mike, Oct 09 '22 at 12:47
Well then you must set that encoding on the output. It would be safer though to use UTF-8 — g00se, Oct 09 '22 at 12:51
With [this](https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/io/OutputStreamWriter.html#%3Cinit%3E(java.io.OutputStream,java.lang.String)) — g00se, Oct 09 '22 at 13:04
I try this: OutputStream outputStream = new FileOutputStream(fiout.getName()); OutputStreamWriter writerout = new OutputStreamWriter(outputStream, StandardCharsets.ISO_8859_1); But it will write ?pfel instead of Äpfel in the file too. — Mike, Oct 09 '22 at 13:10
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/248673/discussion-between-g00se-and-mike). — g00se, Oct 09 '22 at 13:20

icu4j: read and write files with differenz charsets

0 Answers0