How to get text from XML node without trimming whitespaces between two unicode characters

Question

While parsing the XML with SAX parser in JAVA, I am not able to get data as it is in XML. The problem is if the node contains text data with some unicode charaters.

The node.getTextContent() is splitting the content at unicode characters and trimming the whitespace between two unicode characters.

Suppose, if the node is having the data oro-maxilo-facială și implantologie. Please observe the space between ă și.

The method node.getTextContent() returns the string as oro-maxilo-facialăși implantologie (no whitespace).

Below is the code I tried.

private String getNodeContent(Element nodeToSerialize) {
    StringBuffer sb = new StringBuffer();
    if (nodeToSerialize.hasChildNodes()) {
        NodeList nodeList = nodeToSerialize.getChildNodes();
        for (int x = 0; x < nodeList.getLength(); x++) {
            Node node = nodeList.item(x);
            sb.append(node.getTextContent());
        }
    }
    return sb.toString();
}

XML content is

<record>
    <isbn>1234-5689</isbn>
    <titles>
        <title>Revista de chirurgie oro-maxilo-facial&#x103; &#x219;i implantologie</title>
    </titles>
    <number>16</number>
</record>

I have modified the post to include code. Please have a look — RKrishna, Feb 02 '12 at 08:41
Seems like other people have this problem, too: http://stackoverflow.com/questions/5527195/java-dom-gettextcontent-issue — , Feb 02 '12 at 09:07
I am using Apache digester to parse. Digester is splitting the node data into four strings at each unicode character. Later, trims each string and appends to prior one and returns. In our case the title is splitting into four strings 1 - Revista de chirurgie oro-maxilo-facial 2 - ă 3 - ș 4 - i implantologie . Upon trimming the 3 string the whitespace is missing. Is there any way to prevent this and treat all as one string. — RKrishna, Feb 02 '12 at 11:02

score 0 · Accepted Answer · answered Feb 06 '12 at 08:44

0

The problem is with digester1.8. Use commons-digester1.8.1.jar instead of commons-digester1.8.jar. That will solve this whitespace swallowing issue.

answered Feb 06 '12 at 08:44

RKrishna

19
4

How to get text from XML node without trimming whitespaces between two unicode characters

1 Answers1