1

I have a java application that parses an xml file that was encoded in utf-16le. The xml has been erroring out while being parsed due to illegal xml characters. My solution is to read in this file into a java string, then removing the xml characters, so it can be parsed successfully. It works 99% but there are some slight differences in the input output from this process, not caused by the illegal characters being removed, but going from the utf-16le encoding to java string utf-16.. i think

BufferedReader reader = null;
    String fileText = ""; //stored as UTF-16
    try {
        reader = new BufferedReader(new InputStreamReader(in, "UTF-16LE"));
        for (String line; (line = reader.readLine()) != null; ) {
            fileText += line;
        }
    } catch (Exception ex) {
        logger.log(Level.WARNING, "Error removing illegal xml characters", ex);
    } finally {
        if (reader != null) {
            reader.close();
        }
    }

//code to remove illegal chars from string here, irrelevant to problem 

        ByteArrayInputStream inStream = new ByteArrayInputStream(fileText.getBytes("UTF-16LE"));
    Document doc = XmlUtil.openDocument(inStream, XML_ROOT_NODE_ELEM);

Do characters get changed/lost when going from UTF-16LE to UTF-16? Is there a way to do this in java and assuring the input is exactly the same as the output?

Yep
  • 141
  • 1
  • 13
  • What exactly are "illegal XML characters" according to you? Can you give an example? Your code looks much more complicated than necessary; why read everything into a string first and then read again from a `ByteArrayInputStream`? – Jesper Nov 16 '17 at 14:44
  • Characters not allowed in xml 1.0. I am getting a saxparser error character 0x1 is not allowed – Yep Nov 16 '17 at 15:05

1 Answers1

1

Certainly one problem is that readLine throws away the line ending.

You would need to do something like:

       fileText += line + "\r\n";

Otherwise XML attributes, DTD entities, or something else could get glued together where at least a space was required. Also you do not want the text content to be altered when it contains a line break.

Performance (speed and memory) can be improved using a

StringBuilder fileText = new StringBuilder();
... fileText.append(line).append("\n");
... fileText.toString();

Then there might be a problem with the first character of the file, which sometimes redundantly is added: a BOM char.

line = line.replace("\uFEFF", "");
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138