0

I have an XML file for which I want to determine the encoding programmatically. The XML file is present on the device and the dataFilePath contains the path. The encoding declaration in the file says it is UTF-8 but when I check the encoding in Notepad ++ the file is ANSI encoded. Here's what I have tried

String encoding;
        FileReader reader1 = null;
        XMLStreamReader xmlStreamReader = null;
        reader1 = new FileReader(dataFilePath);
        xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(reader1);
        encoding = xmlStreamReader.getEncoding();
        InputStream inputStream = new FileInputStream(dataFilePath);
        Reader reader =
                new InputStreamReader(inputStream, Charset.forName(encoding));

encoding always returns null and an exception is raised

Chirag
  • 3
  • 2
  • Check the following solution that might help you https://stackoverflow.com/a/35794175/14882929 – Tech-leo Feb 08 '21 at 06:47
  • You could check Apache Tika org.apache.tika.detect package. The https://tika.apache.org/1.16/api/index.html?org/apache/tika/detect/MagicDetector.html could be of some use. – Ironluca Feb 08 '21 at 07:10
  • 1
    A `FileReader` does already imply a byte to char conversion using the system’s default charset. Just don’t use a Reader at all, pass the `FileInputStream` to the XML parser. Then, when you assume the XML parser to guess the encoding correctly, there is no point in using it to create another reader, as the parser will already use the determined encoding when being initialized with an `InputStream`. Besides that, don’t initialize variables with `null`, it makes the code less readable, for no benefit. – Holger Feb 08 '21 at 09:17
  • @Holger both return wrong values or null – Chirag Feb 08 '21 at 10:43
  • 1
    Don’t know what you mean with “both”. I see only one approach, asking the XMLStreamReader. It’s not clear what you expected. You say, the XML contains a declaration saying it’s UTF-8, so, of course, the XMLStreamReader will say that the file is encoded in UTF-8. That’s not wrong, that’s how it is supposed to be. The error is on the other side. – Holger Feb 08 '21 at 11:30
  • @Holger my mistake sorry..... the thing is the XML generator is hard coating the UTF-8 value to the XML even though the actual encoding is different.... so I need to work around that – Chirag Feb 09 '21 at 05:53
  • 1
    It would perhaps be easier to make the XML generator write actual UTF-8 to match the declaration. – Holger Feb 09 '21 at 07:26

0 Answers0