3

I need to capture text within <page> tags of my XML file. Whole text, with other tags, their attributes etc. I could do this using, for example, regular expressions, but I need this to be safe, so I would like to use SAXParser.

But I'm afraid that all information that ContentHandler can receive from SAXParser isn't enough to do this (cursor position at start of found XML tag, for example, would help a lot).

So, is there any other, safe way?

Instead of text within <page>, it could be, for example, DOM tree, but I would prefer first way, for performance.

Krzysztof Stanisławek
  • 1,267
  • 4
  • 13
  • 27

1 Answers1

4

Okay, what I would do first is to create yourself a custom DefaultHandler something like the following;

public class PrintXMLwithSAX extends DefaultHandler {

  private int embedded = -1;
  private StringBuilder sb = new StringBuilder();
  private final ArrayList<String> pages = new ArrayList<String>();    


  @Override
  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
      if(qName.equals("page")){
          embedded++;
      }
      if(embedded >= 0) sb.append("<"+qName+">");
  }

  @Override
  public void characters(char[] ch, int start, int length) throws SAXException {
      if(embedded >= 0) sb.append(new String(ch, start, length));
  }

  @Override
  public void endElement(String uri, String localName, String qName) throws SAXException {
      if(embedded >= 0) sb.append("</"+qName+">");
      if(qName.equals("page")) embedded--;
      if(embedded == -1){
          pages.add(sb.toString());
          sb = new StringBuilder();
      }
  }

  public ArrayList<String> getPages(){
      return pages;
  }

}

The DefaultHandler (when parsed) runs through each element and calls startElement(), characters(), endElement() and a few others. The code above checks if the element in startElement() is a <page> element. If so, it increments embedded by 1. After that, each method checks if embedded is >= 0. If it is, it appends the characters inside each element, as well as their tags (excluding attributes in this particular example) to the StringBuilder object. endElement() decrements embedded when it finds the end of a </page> element. If embedded falls back down to -1, we know that we are no longer inside a series of page elements, and so we add the result of the StringBuilder to the ArrayList pages and start a fresh StringBuilder to await another <page> element.

Then you'll need to run the handler and then retrieve your ArrayList of strings containing your <page> elements like so;

    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser saxParser = factory.newSAXParser();
    PrintXMLwithSAX handler = new PrintXMLwithSAX();
    InputStream input = new FileInputStream("C:\\Users\\me\\Desktop\\xml.xml");
    saxParser.parse(input, handler);
    ArrayList<String> myPageElements = handler.getPages();

Now myPageElements is an ArrayList containing all page elements and their contents as strings.

I hope this helps.

Rudi Kershaw
  • 12,332
  • 7
  • 52
  • 77
  • I was sure that `characters()` method can't help me (because it delivers only plain no-xml text). I will check if this works and then accept your answer, thanks! – Krzysztof Stanisławek Jun 06 '14 at 15:56
  • @KrzysztofStanisławek - I spotted a mistake and have updated the answer. I've updated the `characters()` method to return the correct text instead of *all* the text from the XML. : ) – Rudi Kershaw Jun 06 '14 at 16:05
  • 1
    I already accepted, but I see the problem with `wholePage = wholePage.substring(0, wholePage.indexOf(""));`. What if `` is part of text? Can we be sure that every occurrence of this string means that this is closing tag? But when char[] ch is array of characters from whole document - problem is pretty easy to solve - I can use endElement(). I really should read documentation for `characters(...)` earlier. Thanks! – Krzysztof Stanisławek Jun 06 '14 at 16:18
  • I updated this answer because I realised that the old answer suffered from the same issues as using regex would (which is bad). The new code will correctly identify page elements even if those page elements have page elements themselves :S – Rudi Kershaw Jun 16 '14 at 18:02