2

I am working working with very large XML files (100s of MBs). The tree is fairly simple

<items>
  <item>
    <column1>ABC</column1>
    <column2>DEF</column2>
  </item>
  <item>
    <column1>GHI</column1>
    <column2>KLM</column2>
  </item>
</items>

I need to parse this document, and remove some <item> elements. So far, the best peerformance I achieved is using XmlReader, caching each <item> in memory and the writing it back using XmlWriter out if it meets the criteria, and simply ignoring it if it doesn't. Is there anyting i can do to make it faster?

PBG
  • 8,944
  • 7
  • 34
  • 48
  • Are you looking for a .Net solution? – womp Jan 11 '10 at 18:07
  • yes, i am looking for .net, sorry for not clarifying this – PBG Jan 11 '10 at 18:14
  • What kind of performance are you seeing now and how much faster do you need it to be? Is this a one-time migration (i.e. iterate over all existing docs and remove "bad" data) or an ongoing operation (i.e. we receive these 100MB documents every N minutes and need to clean them up before using them)? – Mike Willekes Jan 11 '10 at 18:19
  • The document will come in once a day, but it's up to user specify the node removal criteria. A 250MB document takes 30 seconds to run, and i would like it to be 10 times faster. Converting it to a different format (like JSON) is an alternative though, assuming it will give us better performance. – PBG Jan 11 '10 at 18:43
  • 30 seconds, once per day doesn't seem like a high-value candidate for optimization... unless users are specifying their node-removal criteria and then complaining that 30 seconds is too long to wait for the "sanitized" XML document. – Mike Willekes Jan 11 '10 at 18:57
  • The problem is that the sanitized XML document is not the final stage. That data will later be JOINED with some other data, which is another step in the process. I do appreciate you input though. – PBG Jan 11 '10 at 19:08
  • Makes sense. If you are willing to delegate to a native app (C/C++) I think you could achieve that goal - but I'm not familiar enough with the .net XML parsing libraries to be of much help. Good luck! – Mike Willekes Jan 11 '10 at 19:57

3 Answers3

1

You might be able to save a step by implementing a subclass of XmlReader whose Read method skips over the item elements you're not interested in. Right now, you seem to have two steps: reading and filtering the document with an XmlReader and then using XmlWriter to write it to something that you presumably then read it from. Subclassing XmlReader eliminates that second step; you use the subclassed XmlReader as the input to your XSLT transform or XmlDocument or whatever, and it never builds an intermediate representation of the filtered XML document.

Robert Rossney
  • 94,622
  • 24
  • 146
  • 218
  • This may work, but once i read forward, if my item is good, i'll need to move my "cursor" back to the start of the item. How do i do that? – PBG Jan 11 '10 at 18:35
  • Well, there's (at least) two ways. You can have your XmlReader check its Stream's CanSeek property at creation and throw an exception if it can't seek; then you know you can save the position in the Stream when you start parsing an element, and if the element's good you can parse it again. The better way is to build some kind of intermediate representation for each node - the XmlNodeType, Name, Value, etc. - and save it in a list. Then either throw the list a way or update the XmlReader's properties from the next item in the list when Read is called. – Robert Rossney Jan 13 '10 at 02:04
0

You could use perl or shell scripting to replace out the required items if you can write a quick regular expression to get rid of it. That would avoid loading the whole thing into memory and writing it back out.

Matt
  • 2,720
  • 1
  • 15
  • 9
0

see if you can use xpath querys to determine what you want to and dont want to read with that xmldocument object....look into the following methods of that class SelectSingleNode() which returns an XmlNode object... SelectNodes() which returns an XmlNodeList object.... see if that helps....

kd.
  • 292
  • 2
  • 4
  • 14