0

I'm extracting German addresses from a CSV file using Java CsvReader. Some of the street names have special German characters, also called "Umlaute", like ö,ä,ü,... (Example: Sonnige Höhe). Here is the code I use:

try {
    String addressDataCsvFilename = "Tannis_Export.csv";
    CsvReader addressDataCsvFile = new CsvReader(addressDataCsvFilename, ',', Charset.forName("UTF-8") );
/*
    String[] headers = {
            "PLZ",         // C
            "Strasse",     // E
            "Hausnummer",  // F
        };
 */

    // get headers
    addressDataCsvFile.readHeaders();
    while (addressDataCsvFile.readRecord()) {
        // workaround for issue with CSVReader not finding header in first column
        // String partNumber    = priceListCsvFile.get("PART NUMBER");
        String postleitzahl  = addressDataCsvFile.get("PLZ");
        String strassenName  = addressDataCsvFile.get("Strasse");
        String hausNummer    = addressDataCsvFile.get("Hausnummer");

It turns out that even though I'm specifying UTF-8 as charset, CsvReader.readRecord() doesn't read the special German characters correctly, so "Sonnige Höhe" becomes "Sonnige H�he". How to prevent that?

Abra
  • 19,142
  • 7
  • 29
  • 41
Robert Bethge
  • 145
  • 1
  • 7
  • 4
    When using 3rd party libraries which are not part of the Java standard library you should mention which library you use. Have you checked whether the input file is UTF-8 or maybe rather ISO 8859-1 or Windows-1252? – Slevin Jun 21 '23 at 04:57
  • 4
    are you sure that the reader is actually reading `"Sonnige H�he"`, or is it the console that is displaying it that way? How are you testing it? – user16320675 Jun 21 '23 at 05:00
  • @Abra, I'm using https://www.csvreader.com/java_csv/docs/com/csvreader/CsvReader.html – Robert Bethge Jun 21 '23 at 07:10
  • @Slevin, it's a CSV file I got from the customer...probably generated by Excel. – Robert Bethge Jun 21 '23 at 07:11
  • @user16320675 I'm using Eclipse and I stepped into the readRecord() method of the CsvReader class and I see the strings coming out wrong. – Robert Bethge Jun 21 '23 at 07:14
  • @Slevin turns out you were right. If I change UTF-8 to ISO-8859-1, it works! Thanks all for your inputs! – Robert Bethge Jun 21 '23 at 07:21

1 Answers1

0

If I change the charset from UTF-8 to ISO-8859-1, it works. Here is the modified line:

// DOESN'T WORK: CsvReader  addressDataCsvFile      =   new CsvReader(addressDataCsvFilename, ',', Charset.forName("UTF-8") );
CsvReader   addressDataCsvFile      =   new CsvReader(addressDataCsvFilename, ',', Charset.forName("ISO-8859-1") );
        
Robert Bethge
  • 145
  • 1
  • 7
  • 1
    That means that your input is actually in ISO-8859-1 encoding. "UTF-8" is not magic: it's only correct to use when your input is actually in that encoding. While it's a good idea to [switch to UTF-8 wherever possible](https://utf8everywhere.org), that doesn't mean that it'll magically read things that are in different encodings correctly. – Joachim Sauer Jun 21 '23 at 07:26