12

I am parsing XML using DocumentBuilder.

XML has first line as this:

xml version="1.0" encoding="GBK"

I want to get encoding type of the XML and use it. How can I get "GBK"

Basically i will be making one more XML where i want encoding="GBK" to be retained.

Currently it is getting lost and set to default UTF-8

There are many XML with different encoding and I need to read encoding of the source fileF.

Lii
  • 11,553
  • 8
  • 64
  • 88
user1228785
  • 512
  • 2
  • 6
  • 19

4 Answers4

9

One way to this works like this

final XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader( new FileReader( testFile ) );

//running on MS Windows fileEncoding is "CP1251"
String fileEncoding = xmlStreamReader.getEncoding(); 

//the XML declares UTF-8 so encodingFromXMLDeclaration is "UTF-8"
String encodingFromXMLDeclaration = xmlStreamReader.getCharacterEncodingScheme(); 
  • On my MS Windows machine, `getEncoding()` *always* returns `null`. `getCharacterEncodingScheme()` only returns the declared encoding the file does *not* have a UTF-8 byte order mark, else also `null`. – Matthias Ronge Mar 04 '16 at 09:24
2

This one works for various encodings, taking into concern both the BOM and the XML declaration. Defaults to UTF-8 if neither applies:

String encoding;
FileReader reader = null;
XMLStreamReader xmlStreamReader = null;
try {
    InputSource is = new InputSource(file.toURI().toASCIIString());
    XMLInputSource xis = new XMLInputSource(is.getPublicId(), is.getSystemId(), null);
    xis.setByteStream(is.getByteStream());
    PropertyManager pm = new PropertyManager(PropertyManager.CONTEXT_READER);
    for (Field field : PropertyManager.class.getDeclaredFields()) {
        if (field.getName().equals("supportedProps")) {
            field.setAccessible(true);
            ((HashMap<String, Object>) field.get(pm)).put(
                    Constants.XERCES_PROPERTY_PREFIX + Constants.ERROR_REPORTER_PROPERTY,
                    new XMLErrorReporter());
            break;
        }
    }
    encoding = new XMLEntityManager(pm).setupCurrentEntity("[xml]".intern(), xis, false, true);
    if (encoding != "UTF-8") {
        return encoding;
    }

    // From @matthias-heinrich’s answer:
    reader = new FileReader(file);
    xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(reader);
    encoding = xmlStreamReader.getCharacterEncodingScheme();

    if (encoding == null) {
        encoding = "UTF-8";
    }
} catch (RuntimeException e) {
    throw e;
} catch (Exception e) {
    throw new UndeclaredThrowableException(e);
} finally {
    if (xmlStreamReader != null) {
        try {
            xmlStreamReader.close();
        } catch (XMLStreamException e) {
        }
    }
    if (reader != null) {
        try {
            reader.close();
        } catch (IOException e) {
        }
    }
}
return encoding;

Tested on Java 6 with:

  • UTF-8 XML file with BOM, with XML declaration ✓
  • UTF-8 XML file without BOM, with XML declaration ✓
  • UTF-8 XML file with BOM, without XML declaration ✓
  • UTF-8 XML file without BOM, without XML declaration ✓
  • ISO-8859-1 XML file (no BOM), with XML declaration ✓
  • UTF-16LE XML file with BOM, without XML declaration ✓
  • UTF-16BE XML file with BOM, without XML declaration ✓

Standing on the shoulders of these giants:

import java.io.*;
import java.lang.reflect.*;
import java.util.*;
import javax.xml.stream.*;
import org.xml.sax.*;
import com.sun.org.apache.xerces.internal.impl.*;
import com.sun.org.apache.xerces.internal.xni.parser.*;
Matthias Ronge
  • 9,403
  • 7
  • 47
  • 63
1

Using javax.xml.stream.XMLStreamReader to parse your file, then you can call getEncoding().

Chandra Sekhar
  • 16,256
  • 10
  • 67
  • 90
Emmanuel Bourg
  • 9,601
  • 3
  • 48
  • 76
0

Using Apache Commons IO:

new XmlStreamReader(data).getEncoding()
Lii
  • 11,553
  • 8
  • 64
  • 88