So I am doing some data cleaning, on a series of XML documents using StAX. I want to essentially read in the document and spit out the exact same document with a few of tags missing. The problem I'm having is that I'm not outputting valid XML.
You can see my output on the left, and original doc on the right [here] (https://i.stack.imgur.com/aOptO.jpg). The image on the bottom is also the output from xmllint -valid. As you can see it says, that there is no DTD found, and that there's extra content at the end of the document.
My code to implement the writer is this
public XMLEventWriter setUpWriter(File blah) throws FileNotFoundException, XMLStreamException {
newFileName = thef.getName().substring(0, thef.getName().indexOf("_") + 1);
try {
writer = outputFactory
.createXMLEventWriter(new FileOutputStream(newFileName + "mush.xml"), "UTF-8");
} catch (XMLStreamException ex) {
ex.printStackTrace();
System.out.println("There was an XML Stream Exception, whatever that means for writer");
}
//outputFactory.setProperty("escapeCharacters", false);
eventFactory = XMLEventFactory.newInstance();
StartDocument startDocument = eventFactory.createStartDocument();
writer.add(startDocument);
//writer.add("<!DOCTYPE DjVuXML>");
return writer;
}
This is my code that handles the actual writing.
if (event.isStartElement()) { //first it looks for start elements
StartElement se = event.asStartElement();
if ("OBJECT".equals(se.getName().getLocalPart())) {
writer.add(se);
} else if ("MAP".equals(se.getName().getLocalPart())) {
writer.add(se);
} else if ("PARAM".equals(se.getName().getLocalPart())) {
writer.add(se);
} else if ("LINE".equals(se.getName().getLocalPart())) {
writer.add(se);
} else if ("DjVuXML".equals(se.getName().getLocalPart())) {
writer.add(se);
}else if ("WORD".equals(se.getName().getLocalPart())) {
word.text = reader.getElementText();
EndElement wordEnd = eventFactory.createEndElement("", "", "WORD");
writer.add(se);
Characters characters = eventFactory.createCharacters(word.text);
writer.add(characters);
writer.add(wordEnd);
}
}
} else if (event.isEndElement()) {
EndElement ee = event.asEndElement();
if ("MAP".equals(ee.getName().getLocalPart())) {
writer.add(ee);
} else if ("DjVuXML".equals(ee.getName().getLocalPart())) {
writer.add(ee);
} else if ("LINE".equals(ee.getName().getLocalPart())) {
writer.add(ee);
}
else if ("BODY".equals(ee.getName().getLocalPart())) {
writer.add(ee);
}
}
}
writer.flush();
writer.close();
Now that we've got that out of the way my question is twofold:
1) Is my output not valid because it lacks the DTD?
1a) if Yes how do I include the DTD? Even if No tell me, this has been bothering me
2)If its not the DTD then how the heck do I get this thing valid.
Thanks for your help!!