1

While parsing the XML with SAX parser in JAVA, I am not able to get data as it is in XML. The problem is if the node contains text data with some unicode charaters.

The node.getTextContent() is splitting the content at unicode characters and trimming the whitespace between two unicode characters.

Suppose, if the node is having the data oro-maxilo-facială și implantologie. Please observe the space between ă și.

The method node.getTextContent() returns the string as oro-maxilo-facialăși implantologie (no whitespace).

Below is the code I tried.

private String getNodeContent(Element nodeToSerialize) {
    StringBuffer sb = new StringBuffer();
    if (nodeToSerialize.hasChildNodes()) {
        NodeList nodeList = nodeToSerialize.getChildNodes();
        for (int x = 0; x < nodeList.getLength(); x++) {
            Node node = nodeList.item(x);
            sb.append(node.getTextContent());
        }
    }
    return sb.toString();
}

XML content is

<record>
    <isbn>1234-5689</isbn>
    <titles>
        <title>Revista de chirurgie oro-maxilo-facial&#x103; &#x219;i implantologie</title>
    </titles>
    <number>16</number>
</record>
James Jithin
  • 10,183
  • 5
  • 36
  • 51
RKrishna
  • 19
  • 4
  • Please post some code so we can see what you tried so far. –  Feb 02 '12 at 08:37
  • I have modified the post to include code. Please have a look – RKrishna Feb 02 '12 at 08:41
  • I'm sorry. Posted XML content too – RKrishna Feb 02 '12 at 09:05
  • Seems like other people have this problem, too: http://stackoverflow.com/questions/5527195/java-dom-gettextcontent-issue –  Feb 02 '12 at 09:07
  • I am using Apache digester to parse. Digester is splitting the node data into four strings at each unicode character. Later, trims each string and appends to prior one and returns. In our case the title is splitting into four strings 1 - Revista de chirurgie oro-maxilo-facial 2 - ă 3 - ș 4 - i implantologie . Upon trimming the 3 string the whitespace is missing. Is there any way to prevent this and treat all as one string. – RKrishna Feb 02 '12 at 11:02

1 Answers1

0

The problem is with digester1.8. Use commons-digester1.8.1.jar instead of commons-digester1.8.jar. That will solve this whitespace swallowing issue.

RKrishna
  • 19
  • 4