Okay, what I would do first is to create yourself a custom DefaultHandler
something like the following;
public class PrintXMLwithSAX extends DefaultHandler {
private int embedded = -1;
private StringBuilder sb = new StringBuilder();
private final ArrayList<String> pages = new ArrayList<String>();
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if(qName.equals("page")){
embedded++;
}
if(embedded >= 0) sb.append("<"+qName+">");
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
if(embedded >= 0) sb.append(new String(ch, start, length));
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if(embedded >= 0) sb.append("</"+qName+">");
if(qName.equals("page")) embedded--;
if(embedded == -1){
pages.add(sb.toString());
sb = new StringBuilder();
}
}
public ArrayList<String> getPages(){
return pages;
}
}
The DefaultHandler
(when parsed) runs through each element and calls startElement()
, characters()
, endElement()
and a few others. The code above checks if the element in startElement()
is a <page>
element. If so, it increments embedded
by 1. After that, each method checks if embedded
is >= 0. If it is, it appends the characters inside each element, as well as their tags (excluding attributes in this particular example) to the StringBuilder
object. endElement()
decrements embedded
when it finds the end of a </page>
element. If embedded falls back down to -1, we know that we are no longer inside a series of page elements, and so we add the result of the StringBuilder
to the ArrayList
pages
and start a fresh StringBuilder
to await another <page>
element.
Then you'll need to run the handler and then retrieve your ArrayList
of strings containing your <page>
elements like so;
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
PrintXMLwithSAX handler = new PrintXMLwithSAX();
InputStream input = new FileInputStream("C:\\Users\\me\\Desktop\\xml.xml");
saxParser.parse(input, handler);
ArrayList<String> myPageElements = handler.getPages();
Now myPageElements
is an ArrayList
containing all page elements and their contents as strings.
I hope this helps.