0

I have sample xml

<?xml version="1.0" encoding="UTF-8"?>
  <tag_1>
     <tag_2>A</tag_2>
     <tag_3>B</tag_3>
     <tag_4>C</tag_4>
     <tag_5>D</tag_5>
  </tag_1>
</xml>

Now i am interested to extract only specific data.

For example

tag_1/tag_5 -> D

tag_1/tag_5 is my data definition (the only data which i want) which is dynamic in nature that means tomorrow tag_1/tag_4 will be my data definition.

So in reality my xml is a large data set. And these xml payloads comes like 50,000/hour to 80,000/hour.

I would like to know if there already high performance xml reader tool or some special logic i can implement which extracts data depending upon data definition.

Currently i have implementation using Stax parser but its taking nearly a day to parse 80,000 xml's.

public class VTDParser {

    private final Logger LOG = LoggerFactory.getLogger(VTDParser.class);

    private final VTDGen vg;

    public VTDParser() {
        vg = new VTDGen();
    }

    public String parse(final String data, final String xpath) {
        vg.setDoc(data.getBytes());
        try {
            vg.parse(true);
        } catch (final ParseException e) {
            LOG.error(e.toString());
        }

        final VTDNav vn = vg.getNav();
        final AutoPilot ap = new AutoPilot(vn);
        try {
            ap.selectXPath(xpath);
        } catch (final XPathParseException e) {
            LOG.error(e.toString());
        }

        try {
            while (ap.evalXPath() != -1) {
                final int val = vn.getText();
                if (val != -1) {
                    return vn.toNormalizedString(val);
                }
            }
        } catch (XPathEvalException | NavException e) {
            LOG.error(e.toString());
        }
        return null;
    }
}
vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30
Saurabh Kumar
  • 16,353
  • 49
  • 133
  • 212
  • Not sure why i got -1. Am i not clear ? I am just looking for ideas and not asking somebody to implement for me. – Saurabh Kumar Jan 09 '17 at 21:45
  • 50-80,000/hour, that is like 20 per second. If you only work single-threaded, that means 1/20th of a second for one xml. If the xml files are very large as you say, you'll never be able to parse it within 0.05seconds, especially as their might be other overhead you probably cannot control (e.g. network/disk latency when reading xml files). So to reach your goal, you first need to parallelize the work. And then probably think about putting the data into a database for easier querying, so you don't have to re-parse all documents when your query changes tomorrow. But a database needs planning too – cello Jan 09 '17 at 21:53
  • Yes sir. Actually i implemented finally using Vtd-Xml. I am also eager to hear your answer. – Saurabh Kumar Jan 12 '17 at 13:57
  • ok, I will submit a code snippet, stay tuned... – vtd-xml-author Jan 14 '17 at 08:51
  • @vtd-xml-author so i posted the code. I see one issue if i make only one instance of VTDParser and keep calling parse method of VTDParser than vg.getNav() is ending up in some sort of exception. Cannot see because using multithreading and wrapped in Future. Only the first call ends in success , rest all end in some sort of exception. – Saurabh Kumar Jan 14 '17 at 22:09
  • how big are you xmls on average? – vtd-xml-author Jan 15 '17 at 03:43
  • Hi.. xml's can be small or super big too (sap idoc). So what i was trying to do was create one VDTParser on one XML and using multiple consumer thread do xpath on that one parser. Since it is not working so i am creating new VTDParser for every new consumer thread but i am not satisfied with it since if for example I have 100 xpaths i will end up in creating 100 instances of VDTParser. Any way we can avoid this ? Also how much resource it takes to create when i do final VTDGen vg = new VTDGen(); – Saurabh Kumar Jan 15 '17 at 09:56
  • ok, I see that your requirement are not as simple as I expected orginally.... this is gonna be a long corresponce – vtd-xml-author Jan 17 '17 at 01:50
  • do you know how to reuse xpath expression? – vtd-xml-author Jan 17 '17 at 03:31
  • Hi. I posted my code above. Now seperate thread comes to parse method. Could you please tell how to reuse xpath and VTDgen in the code above – Saurabh Kumar Jan 17 '17 at 10:10

1 Answers1

0

This is my mod to your code which compiles xpath once and reuse many times. It compiles the xpath without binding to a VTDNav instance. It also calls resetXPath before exiting the parse method.. I, however, didn't show you how to preindex the xml docs with VTD... to avoid repetitive parsing.... and I suspect it might be the difference maker for your project... Here is a paper reference regarding the capabilities of vtd-xml..

http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf

import com.ximpleware.*;


public class VTDParser {
      // private final Logger LOG = LoggerFactory.getLogger(VTDParser.class);

        private final VTDGen vg;
        private final AutoPilot ap;
        public VTDParser() throws VTDException{
            vg = new VTDGen();
            ap = new AutoPilot();
            ap.selectXPath("/a/b/c");// this is how you compile xpath w/o binding to an XML doc
        }

        public String parse(final String data, final AutoPilot ap1) {
            vg.setDoc(data.getBytes());
            try {
                vg.parse(true);
            } catch (final ParseException e) {
                LOG.error(e.toString());
            }

            final VTDNav vn = vg.getNav();
            ap1.bind(vn);
            try {
                while (ap.evalXPath() != -1) {
                    final int val = vn.getText();
                    if (val != -1) {
                        return vn.toNormalizedString(val);
                    }
                }
            } catch (XPathEvalException | NavException e) {
                LOG.error(e.toString());
            }
            ap.resetXPath();// reset your xpath here
            return null;
        }
}
vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30