2

I had gone through the posts on validating huge xml files but all of those talks about 250MB(Max) file size.

  1. File size is approximately 10GB.
  2. I currently have a tasklet to validate the xml which is using XmlValidator to validate the xml file aganist the xsd.

Problem Statment : When dealing with such a huge xml and validating it loads the entire file in memory so i am getting OutOfMemoryException. Is there any way to validate the xml which can perform the validation Streamwise. I dont want to load whole file in-memory while validating.

Thanks in advance.

Xstian
  • 8,184
  • 10
  • 42
  • 72
Jay
  • 429
  • 2
  • 8
  • 23
  • A schema applies only to a whole document. I don't think it is possible to do that step by step – hek2mgl Dec 16 '15 at 08:56
  • You mean to say if i have to validate against the schema then i should load the whole document right? – Jay Dec 16 '15 at 08:59
  • Yes, that's what I mean. – hek2mgl Dec 16 '15 at 09:17
  • 1
    There's `javax.xml.validation.ValidatorHandler` class to validation. The second example at http://www.programcreek.com/java-api-examples/index.php?api=javax.xml.validation.ValidatorHandler shows the technique. As far as I can see in the example the validation happens during the reading of XML. Perhaps this is what you need. – Mark Shevchenko Dec 16 '15 at 09:27

1 Answers1

1

Usually, large files contain the same structure repeated 1000s of times, and each instance of the structure is independent of the others. Sometimes there is a header and/or a footer. An XML parser can read a single instance of the repeating element and validate it without needing to look at previous/following elements.

So there is no reason why you should not be able to validate while streaming - the XML parser that ships with IBM Java definitely can ( I have used it myself ).

You have not told us which language you are using, so it's hard to be any more specific than that.

kimbert
  • 2,376
  • 1
  • 10
  • 20
  • My application is running in Java. It is using Spring-batch framework and for reader i am using StaxEventItemReader and for unmarshaling jaxb. Validation is a tasklet in my job. – Jay Dec 16 '15 at 09:33
  • In that case, I suggest that you go ahead and enable validation on the parser. – kimbert Dec 16 '15 at 10:03
  • I honestly do not know whether StaxEventItemReader can read the XML with validation enabled. The documentation is not particularly clear. – kimbert Dec 16 '15 at 14:22
  • StaxEventItemReader will expect a unmarshaller property der i am specifying a schema property and as value i am passing the xsd. This should happen in the chunk i belive. Please correct me if i am wrong. – Jay Dec 17 '15 at 07:32
  • I'm not qualified to comment on how to use StaxEventItemReader. The question was about validating huge XML documents in streaming mode. If you need help with configuring StaxEventItemReader for validation then maybe a separate question would be appropriate? – kimbert Dec 17 '15 at 08:40