0

I have to make performance test on VTD-XML library in order to make not just simple parsing but additional transformation in the parsing. So I have 30MB input XML and then I transform it with custom logic to other XML. SO I want to remove all thinks which slow the whole process which comes from my side(because of not good use of VTD library). I tried to search tips for optimization but can not find them. I noutised that:

'0'. What is better to use for selection selectXPath, or selectElement?

  1. Use parsing without namespace is much faster.

    File file = new File(fileName);
    VTDGen vtdGen = new VTDGen(); 
    vtdGen.setDoc_BR(new byte[(int) file.length()]); 
    vtdGen.parse(false);
    
  2. Read from byte or pass to VTDGen ?

    final VTDGen vg = new VTDGen();
    vg.parseFile("books.xml", false);
    

or

// open a file and read the content into a byte array
File f = new File("books.xml");
FileInputStream fis = new FileInputStream(f);
byte[] b = new byte[(int) f.length()];
fis.read(b);

VTDGen vg = new VTDGen();
vg.setDoc(b);
vg.parse(true);

Using the second approach - 0.01 times faster...(can be from everything)

What is the difference with parseFile the file is limited upTo 2GB with namespaceaware true and 1GB witout but what for the byte approach?

  1. Reuse buffers

You can ask VTDGen to reuse VTD buffers for the next parsing task. Otherwise, by default, VTDGen will allocate new buffer for each parsing run.

Can you give an example for that?

  1. Adjust LC level to 5

By default, it is 3. But you can set it to 5. When your XML are deeply nested, setting LC level to 5 results in better XPath performance. But it increases memory usage and parsing time very slightly.

    VTDGen vg = new VTDGen();
    vtdGen.selectLcDepth(5);

But have runtime exception. Only works with 3

  1. Indexing

Use VTD+XML indexing- Instead of parsing XML files at the time of processing request, you can pre-index your XML into VTD+XML format and dump them on disk. When the processing request commences, simply load VTD+xml in memory and voila, parsing is no longer needed!!

  VTDGen vg = new VTDGen();
    if (vg.parseFile(inputName,true)){
       vg.writeIndex(new FileOutputStream(outputName));
     }

Can anyone knows how to use it? What happens if the file is changes, how to tripper new re-indexing. And if there is 10kb change in 3GB does the parsing will take time for the whole new file parsing or just for the changed lines?

  1. overwrite feature

The overwrite feature aka. data templating- Because VTD-XML retains XML in memory as is, you can actually create a template XML file (pre-indexed in vtd+xml) whose value fields are left blank and let your app fill in the blank, thus creating XML data that never need to be parsed.

Xelian
  • 16,680
  • 25
  • 99
  • 152

1 Answers1

0

I think you should look at the examples bundled with vtd-xml release... and build up the expertise gradually... fortunately, vtd-xml is in my view one of the easiest XML API by a large margin... so the learning curve won't be SAX/STAX kind of difficult.

My answer to your numbered lists above...

  1. selectXPath is for xpath evaluation. selectElement is similar to getElementByTag()

  2. turning on Namespace awareness has little/no effect on parsing performance whatsoever... can you reference the source of your 100x slowdown claim?

  3. you can read from bytes or read from files directly... here is a link to a blog post

    https://ximpleware.wordpress.com/2016/06/02/parsefile-vs-parse-a-quick-comparison/

3.Buffer reuse is somewhat an advanced feature..let's get to that at a later time

4.If you get the latest version (2.13), you will not get runtime exception with that method call...

  1. to parse xml doc larger than 2GB, you need to switch to extended edition of vtd-xml which is a separate API bundled with standard vtd-xml...

  2. There are examples bundled with vtd-xml distribution that you might want to look at first... here is an article on this subject http://www.codeproject.com/Articles/24663/Index-XML-Documents-with-VTD-XML

vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30