3

There is

 <BATCHNAME>&#4; Any</BATCHNAME> 

tag in my xml request having '' characters in value. Without these characters my code works perfectly,but in some cases i have these characters. It gives me following error

[Fatal Error] :144:28: Character reference "&# org.xml.sax.SAXParseException; lineNumber: 144; columnNumber: 28; Character reference "&# at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) at d.b(AllCommonTasks.java:277) at...

I need these characters to be validate

I am trying this code =>

try {                      

        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder();

        URLConnection urlConnection = new URL(urlString).openConnection();
        urlConnection.addRequestProperty("Accept", "application/xml");
        urlConnection.addRequestProperty("User-Agent", "Mozilla/5.0 ( compatible ) ");
        Document doc = db.parse(urlConnection.getInputStream());
        doc.getDocumentElement().normalize();

        str = convertDocumentToString(doc);


    }catch(Exception e){
        System.err.println("In exception 1");
        e.printStackTrace();
    }

How can I solve this?

kleopatra
  • 51,061
  • 28
  • 99
  • 211
TejpalBh
  • 427
  • 4
  • 13
  • `"` would have been `"`. If it is really ``, read the XML as text, and patch the XML version to version 1.1 for these in 1.0 forbidden control characters. – Joop Eggen Mar 25 '19 at 11:49

1 Answers1

3

Looking at the Wikipedia page for XML and HTML entity references, entity references that follow the &#nnnn; pattern are Unicode code points in decimal form, which means that &#4; would be equivalent to Unicode U+0004: END OF TRANSMISSION which is a nonprinting character.

So I think the parser is right in this case to fail.

In fact if you look at the source of com.sun.org.apache.xerces.internal.impl.XMLScanner#scanCharReferenceValue, you can see that it references com.sun.org.apache.xerces.internal.util.XMLChar#isValid here:

/**
 * Returns true if the specified character is valid. This method
 * also checks the surrogate character range from 0x10000 to 0x10FFFF.
 * <p>
 * If the program chooses to apply the mask directly to the
 * <code>CHARS</code> array, then they are responsible for checking
 * the surrogate character range.
 *
 * @param c The character to check.
 */
public static boolean isValid(int c) {
    return (c < 0x10000 && (CHARS[c] & MASK_VALID) != 0) ||
           (0x10000 <= c && c <= 0x10FFFF);
} // isValid(int):boolean
paltaie
  • 31
  • 2
  • yes the characters are not valid xml. But i want those characters in xml for some cases only, there should be any way to add/allow these characters..? – TejpalBh Mar 25 '19 at 10:58
  • 2
    One way is to move to XML 1.1 which adds support for U+0001 onwards, but not sure if you have control over the incoming XML document? See https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.1 – paltaie Mar 25 '19 at 11:06