VTD-XML seems to be spoiling escaped string in XML document

Question

I am working on an XML data set (the DrugBank database available here) where some fields contain escaped XML characters like "&", etc.

To make the problem more concrete, here is an example scenario:

<drugs>
    <drug>
        <drugbank-id>DB00001</drugbank-id>
        <general-references>
            # Askari AT, Lincoff AM: Antithrombotic Drug Therapy in Cardiovascular Disease. 2009 Oct; pp. 440&#x2013;. ISBN 9781603272346. "Google books":http://books.google.com/books?id=iadLoXoQkWEC&amp;pg=PA440.
        </general-references>
        .
    </drug>
    <drug>
    ...
    </drug>
    ...
</drugs>

Since the entire document is huge, I am parsing it as follows:

VTDGen gen = new VTDGen();
try {
    gen.setDoc(Files.readAllBytes(DRUGBANK_XML));
    gen.parse(true);
} catch (IOException | ParseException e) {
    SystemHelper.exitWithMessage(e, "Unable to process Drugbank XML data. Aborting.");
}
VTDNav nav = gen.getNav();
AutoPilot pilot = new AutoPilot(nav);
pilot.selectXPath("//drugs/drug");
while (pilot.evalXPath() != -1) {
    long fragment = nav.getContentFragment();
    String drugXML = nav.toString((int) fragment, (int) (fragment >> 32));
    System.out.println(drugXML);
    finerParse(drugXML); // another method handling a more detailed data analysis
}

When I tested the finerParse method with sample xml (snippets copy-pasted from the same data), it worked fine. But when called from the above code, it failed with the error message Errors in Entity: Illegal entity char. Upon printing the input to finerParse (i.e., the drugXML string), I noticed that the string &pg=PA440 in the original xml was changed to "&pg=PA440".

Why is this happening? All I am doing is parsing it using with a very well known parser.

P.S. I have found an alternate solution where I am simply passing the VTDNav as the argument to finerParse instead of first obtaining the content string and passing that string. But I am still curious about what is going wrong with the above approach.

got one more suggestion: never pass a string, u should pass the byte segment instead, pass a string into another function is not efficient. — vtd-xml-author, Jan 08 '15 at 20:19
Thank you for that suggestion. My current approach is to pass the VTDNav itself to `finerParse`. I have not done any benchmarking, but intuitively, that should be the most efficient method. I just need to careful that I use the `toElement(int, String)` method correctly. Please correct if I am wrong here. — Chthonic Project, Jan 08 '15 at 20:48

score 1 · Accepted Answer · answered Jan 08 '15 at 03:38

1

Instead of vtdNav.toString() use vtdNav.toRawString() the problem should go away...let me know if it works or not.

answered Jan 08 '15 at 03:38

vtd-xml-author

3,319
4
22
30

Works like a charm! By the way, is there an authoritative documentation for VTD-XML? I keep finding short tutorials, etc. but nothing comprehensive. And the Javadoc is extremely limited, which means that for many methods (e.g., if I want to know the difference between `toString()`, `toRawString()`, `toNormalizedString()` and `toNormalizedString2`), it is very difficult to figure out what's going on. – Chthonic Project Jan 08 '15 at 04:06
all documents on vtd-xml web site are authoritative... what kind of documents are you looking for? any suggestions? – vtd-xml-author Jan 08 '15 at 19:07
Sorry about my previous remark. I had forgotten to add the javadoc to my project, so everything was showing up empty. Corrected that, and life is perfect. VTD-XML has a slightly steeper learning curve than DOM-based parsers, but it's an amazing tool, especially if speed is a priority! Thank you for vtd-xml :-) – Chthonic Project Jan 08 '15 at 20:07

VTD-XML seems to be spoiling escaped string in XML document

1 Answers1

Linked