0

I am trying to Parse an xml which is containing — and &#8217 numeric character references. On parsing it gives me output as "?". it is not only these two, any HTML/XMl numeric character references in the xml creates this issue. only pre-defined entities are getting accepted by the saxparser

i use defaulthandler saxparser. system out in character method shows me a question mark for the numeric character references.

i did lot of googling, everywhere i see that usage of numberic character refernce should not create any issue.

Any help?

1 Answers1

0

System.out in character method shows me a question mark for the numeric character references.

That sounds like a character encoding problem of your output / console. The following works with JSE 7

public static void main(String[] args) throws Exception{ SAXParser parser = SAXParserFactory.newInstance().newSAXParser();

    XMLReader reader = parser.getXMLReader();
    reader.setContentHandler(new ContentHandler() {

        // other methods omitted 

        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            System.out.println(new String(ch, start, length));

        }
    });

    FileReader fReader = new FileReader("/tmp/HelloWorld.xml");
    reader.parse(new InputSource(fReader));
    fReader.close();
}

With XML File:

<?xml version="1.0" encoding="UTF-8"?>
<Test>
Hello World&#8217;
</Test>

Output: Hello World’

Have you tried to look at the incomming character array using a debugger?

andih
  • 5,570
  • 3
  • 26
  • 36
  • char[] ch is having two values. one is ? and other is ^@ – user1400021 May 17 '12 at 04:55
  • It still looks like an encoding problem. What are the "real" values of the character array? You can print the hex values of the characters using something like `for (int idx = 0; idx < length; ++idx) System.out.println(String.format("%h %c",ch[start + idx],ch[start+idx]));` – andih May 17 '12 at 05:13
  • my sysout statement System.out.println("ch====="+String.format("%h %c",ch[start + i],ch[start+i])); output: ch=====2014 ? ch=====0 ^@ – user1400021 May 17 '12 at 05:27
  • 8212 = 0x2014 is a "—"; 8217 = 0x2019 is a "’"; 0 is a non printable character and usually marks the end of a String. 0 and the representation as ^@ look very strange. Can you post your Code and your XML file. – andih May 17 '12 at 05:44
  • same code works fine in my local workspace not in our linux batch server – user1400021 May 17 '12 at 06:56
  • i am unable to post my code. it is very similar to the one you have shown above. – user1400021 May 17 '12 at 07:01
  • Without any knowledge about your code, your batch server, your batch server environment like charsets, jvm, your xml file, encoding of your xml file, .... it's nearby impossible provide any help. – andih May 17 '12 at 07:08