Loading huge 4Gb XML file using vtd-xml

Question

I am evaluating vtd-xml as a possible solution for a large data migration project. The input data is in xml format and if vtd-xml is viable it would save a lot of dev time. I run the example Process Huge XML Documents (Bigger than 2GB) from vtd-xml website: http://vtd-xml.sourceforge.net/codeSample/cs12.html.

I successfully process 500Mb but get the dreaded java.lang.OutOfMemoryError: Java heap space error with a 4Gb file.

JVM Arguments: -Xmn100M -Xms500M -Xmx2048M.

JVM Arguments: -Xmn100M -Xms500M -Xmx4096M.

And with Maven:

set MAVEN_OPTS=-Xmn100M -Xms500M -Xmx2048M

set MAVEN_OPTS=-Xmn100M -Xms500M -Xmx4096M

NOTE: I have tested it with various combinations of the JVM arguments.

I have studied the vtd-xml site and API docs and browsed numerous questions here and elsewhere. All the awnsers point to setting the JVM memory higher or adding more physical memory. The vtd-xml website refer to memory usage of 1.3x-1.5x the xml file size but if using 64bit one should be able to process much larger files than available memerory. Surely it would also not be feasible to add 64Gb memory to process a 35Gb xml file.

Environment:

Windows 7 64 bit. 6Gb RAM. (Closed all other apps, 85% memory avaibale)

java version "1.7.0_09"

Java(TM) SE Runtime Environment (build 1.7.0_09-b05)

Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)

Eclipse Indigo

Maven 2

Running the example from both Eclipse and Maven throws the Out of memory exception.

Example code:

 import com.ximpleware.extended.VTDGenHuge;
 import com.ximpleware.extended.VTDNavHuge;
 import com.ximpleware.extended.XMLMemMappedBuffer;

 public class App {

/* first read is the longer version of loading the XML file */
public static void first_read() throws Exception{
XMLMemMappedBuffer xb = new XMLMemMappedBuffer();
    VTDGenHuge vg = new VTDGenHuge();
    xb.readFile("C:\\Temp\\partial_dbdump.xml");
    vg.setDoc(xb);
    vg.parse(true);
    VTDNavHuge vn = vg.getNav();
    System.out.println("text data ===>" + vn.toString(vn.getText()));
}   

/* second read is the shorter version of loading the XML file */
public static void second_read() throws Exception{
    VTDGenHuge vg = new VTDGenHuge();
    if (vg.parseFile("C:\\Temp\\partial_dbdump.xml",true,VTDGenHuge.MEM_MAPPED)){
        VTDNavHuge vn = vg.getNav();
        System.out.println("text data ===>" + vn.toString(vn.getText()));
    }
}

public static void main(String[] s) throws Exception{
    first_read();
    //second_read();
}

}

Error:

 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.ximpleware.extended.FastLongBuffer.append(FastLongBuffer.java:209)
at com.ximpleware.extended.VTDGenHuge.writeVTD(VTDGenHuge.java:3389)
at com.ximpleware.extended.VTDGenHuge.parse(VTDGenHuge.java:1653)
at com.epiuse.dbload.App.first_read(App.java:14)
at com.epiuse.dbload.App.main(App.java:29)

Any help would be appreciated.

Do you have the actual file available? I would be glad to test it on my end — vtd-xml-author, Nov 16 '12 at 20:15
Thanks. The file unfortunately contian client data and due to nda i am unable to share it, however the 4Gb file is generated by a dbunit standard sequenced export. The second read is configured with mem mapped but app fail on parsefile method. Appreciate your help. — user1829870, Nov 16 '12 at 23:07
does memmap work with your 35 gb file at the moment after the heap increase to say 20 gb? — vtd-xml-author, Nov 17 '12 at 00:51
What is line 14? To me it looks like you are converting the whole file to a String and dumping that to `System.out` leading to have the whole file in memory, probably even twice due to the first of `getText` and then a `toString`. Not to mention everything else. Trying to load such a large file in one go is probably something you don't want to do. — M. Deinum, May 02 '16 at 06:55

score 3 · Accepted Answer · answered Nov 16 '12 at 15:10

3

You are telling Java it has a maximum heap size of 2GB and then asking it to process an XML file that is 4GB big.

To have a chance of having this work, you need to define a maximum heap that is larger than the size of the file you are trying to process - or else change the processing mechanism to one that doesn't need the whole file in memory at the same time.

From their web site,

The world's most memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.

This means that for a 4GB file you need around 6GB max heap size, assuming your app doesn't need memory for anything else.

Try these JVM arguments:

-Xmn100M -Xms2G -Xmx6G

You might still run out of memory, but at least now you have a chance.

Oh yes - and you might find your Java now fails to start because the OS can't give java the memory it asks for. If that happens, you need a machine with more RAM (or maybe a better OS)

answered Nov 16 '12 at 15:10

Bill Michell

8,240
3
28
33

Thanks Bill. Ximpleware claim to process and support for 256GB file processing. Does that mean one would need 400GB+ of RAM. Is this a real world scenario. They also state that it could be possible to process much larger files than available memory. I believe that outofmemory exception happen on the parsefile method which is internal to vtd-xml and would probably be much larger task replacing than intended for the project. – user1829870 Nov 16 '12 at 18:32
1

I'm not an expert on this framework, but systems exist that have this much RAM - either as real or as virtual memory. – Bill Michell Nov 16 '12 at 18:38
1

it would need more about the same amount of ram as the file size.. while the file can be mem-mapped, the location cache and other info is kept in memory. If you have lots of memory, enough to hold everything in memory, then load everything in memory as it makes your app faster. – vtd-xml-author Nov 16 '12 at 19:29
Thats really unfortunate we would enentually have to load a 35Gb dataset and adding that amount of physical memory would not be feasible in my case. I dont want to take away from the really great work you guys did with vtd-xml and hope in the future you would consider an out of mem parser even if performance is impacted slightly it would be made up with the great xpath access you already have – user1829870 Nov 16 '12 at 23:16
1

one more thing: adding 35 GB is not that crazy nowadays especially in an enterpise environment... we have worked with use cases where 64GB is available and vtd-xml works fine.... – vtd-xml-author Nov 17 '12 at 00:56
thanks Bill, vtd-xml-author, i agree, however in our specific case this would be a once off data migration for a web app that requires less than 1GB memory and have 12GB to my disposal but it would not be feasible in our case to add 20GB+ RAM for a once off process and not needing it for the future. As per suggestions on other similair questions we are going to split the XML data file in 1GB chunks with an seperate index file of files created and other location based infomation. The process would be a bit slower but hopefully neglegable. – user1829870 Nov 19 '12 at 06:52

vtd-xml-author · Answer 2 · 2016-05-02T06:49:12.977

1

You must be using extended vtd-xml for your loading... standard vtd-xml only supports up to 2GB document loading... extended vtd-xml supports documents up to 256 GB in size. It also enables lazy loading (i.e memory mapping). You don't lose the comfort and efficiency of XPath at all.

edited May 02 '16 at 06:49

answered May 02 '16 at 02:04

vtd-xml-author

3,319
4
22
30

Loading huge 4Gb XML file using vtd-xml

2 Answers2