Extracting Text Nodes From XML File Using SAX Parser in JAVA

Question

So I am currently using SAX to try and extract some information from a a number of xml documents I am working from. Thus far, it is really easy to extract the attribute values. However, I have no clue how to go about extracting actual values from a text node.

For example, in the given XML document:

<w:rStyle w:val="Highlight" /> 
  </w:rPr>
  </w:pPr>
- <w:r>
  <w:t>Text to Extract</w:t> 
  </w:r>
  </w:p>
- <w:p w:rsidR="00B41602" w:rsidRDefault="00B41602" w:rsidP="007C3A42">
- <w:pPr>
  <w:pStyle w:val="Copy" />

I can extract "Highlight" no problem by getting the value from val. But I have no idea how to get into that text node and get out "Text to Extract".

Here is my Java code thus far to pull out the attribute values...

private static final class SaxHandler extends DefaultHandler 
    {
        // invoked when document-parsing is started:
        public void startDocument() throws SAXException 
        {
            System.out.println("Document processing starting:");
        }

        // notifies about finish of parsing:
        public void endDocument() throws SAXException 
        {
            System.out.println("Document processing finished. \n");
        }

        // we enter to element 'qName':
        public void startElement(String uri, String localName, 
                String qName, Attributes attrs) throws SAXException 
        {
            if(qName.equalsIgnoreCase("Relationships"))
            {
                // do nothing
            }
            else if(qName.equalsIgnoreCase("Relationship"))
            {
                // goes into the element and if the attribute is equal to "Target"...
                String val = attrs.getValue("Target");
                // ...and the value is not null
                if(val != null)
                {
                    // ...and if the value contains "image" in it...
                    if (val.contains("image"))
                    {
                        // ...then get the id value
                        String id = attrs.getValue("Id");
                        // ...and use the substring method to isolate and print out only the image & number
                        int begIndex = val.lastIndexOf("/");
                        int endIndex = val.lastIndexOf(".");
                        System.out.println("Id: " + id + " & Target: " + val.substring(begIndex+1, endIndex));
                    }
                }
            }
            else 
            {
                throw new IllegalArgumentException("Element '" + 
                        qName + "' is not allowed here");
            }
        }

        // we leave element 'qName' without any actions:
        public void endElement(String uri, String localName, String qName) throws SAXException 
        {
            // do nothing;
        }
     }

But I have no clue where to start to get into that text node and pull out the values inside. Anyone have some ideas?

Have you considered using XPath it is a lot easier... – vtd-xml-author May 24 '16 at 00:08 — vtd-xml-author, May 24 '16 at 00:08

score 5 · Accepted Answer · answered Jun 29 '11 at 21:47

5

Here's some pseudo-code:

private boolean insideElementContainingTextNode;
private StringBuilder textBuilder;

public void startElement(String uri, String localName, String qName, Attributes attrs) {
    if ("w:t".equals(qName)) { // or is it localName?
        insideElementContainingTextNode = true;
        textBuilder = new StringBuilder();
    }
}

public void characters(char[] ch, int start, int length) {
    if (insideElementContainingTextNode) {
        textBuilder.append(ch, start, length);
    }
}

public void endElement(String uri, String localName, String qName) {
    if ("w:t".equals(qName)) { // or is it localName?
        insideElementContainingTextNode = false;
        String theCompleteText = this.textBuilder.toString();
        this.textBuilder = null;
    }
}

answered Jun 29 '11 at 21:47

JB Nizet

678,734
91
1,224
1,255

Hmm, I tried that, but it didn't extract any text. Can you explain what that code is supposed to do? – This 0ne Pr0grammer Jun 29 '11 at 22:07
In startElement, you check if the parser starts reading the element containing the text node you want to extract. If yes, you set a boolean variable to true. This way, the characters method knows that it's inside the appropriate element, and it stores the read text inside a StringBuilder. The method endElement is called when the end of the element is reached. You can thus get the contents of the StringBuilder and store it whereever you want. I only stored it in a local variable (theCompleteText), but you may store it in an instance variable if you need to. – JB Nizet Jun 30 '11 at 07:09
You can get rid of that boolean and test `if (textBuilder != null)` in the characters method instead. – daiscog Nov 13 '14 at 11:51

Extracting Text Nodes From XML File Using SAX Parser in JAVA

1 Answers1

Linked