5

Is there some way I can combine two XmlDocuments without holding the first in memory?

I have to cycle through a list of up to a hundred large (~300MB) XML files, appending to each up to 1000 nodes, repeating the whole process several times (as the new node list is cleared to save memory). Currently I load the whole XmlDocument into memory before appending new nodes, which is currently not tenable.

What would you say is the best way to go about this? I have a few ideas but I'm not sure which is best:

  1. Never load the whole XMLDocument, instead using XmlReader and XmlWriter simultaneously to write to a temp file which is subsequently renamed.
  2. Make a XmlDocument for the new nodes only, and then manually write it to the existing file (i.e. file.WriteLine( "<node>\n" )
  3. Something else?

Any help will be much appreciated.

Edit Some more details in answer to some of the comments:

The program parses several large logs into XML, grouping into different files by source. It only needs to run once a day, and once the XML is written there is a lightweight proprietary reader program which gives reports on the data. The program only needs to run once a day so can be slow, but runs on a server which performs other actions, mainly file compression and transfer, which cannot be effected too much.

A database would probably be easier, but the company isn't going to do this any time soon!

As is, the program runs on the dev machine using a few GB of memory at the most, but throws out of memory exceptions when run on the sever.

Final Edit The task is quite low-prority, which is why it would only cost extra to get a database (though I will look into mongo).

The file will only be appended to, and won't grow indefinitely - each final file is only for a day's worth of the log, and then new files are generated the following day.

I'll probably use the XmlReader/Writer method since it will be easiest to ensure XML validity, but I have taken all your comments/answers into consideration. I know that having XML files this large is not a particularly good solution, but it's what I'm limited to, so thanks for all the help given.

Overlord_Dave
  • 894
  • 10
  • 27
  • 2
    I'd think number 1 is the way to go, but I have no practical experience working with large files like that. – Jeff Mercado Aug 03 '12 at 15:54
  • What is the end goal, i mean the achievement out of it – HatSoft Aug 03 '12 at 16:02
  • 1
    Can you give some more background on the problem? Perhaps switching to a database is a better solution. – eabraham Aug 03 '12 at 16:03
  • I've answered some of the comments in and edit. @JeffMercado This would work but could potentially be too processor heavy. – Overlord_Dave Aug 03 '12 at 16:18
  • 1
    I suggest doing most kinds of file manipulations to temp files and, if they succeed, do a `File.Replace` of the old file with the temp file. This saves your data if anything goes wrong during the manipulations. – Dour High Arch Aug 03 '12 at 16:25
  • 1
    If it's such a large data set (and assuming you can't use a database), then wouldn't a binary file be better? There's no need to read all that data into memory as text; dump it out to XML when you're done with the updates. Better still, why not use Mongo DB (no installation required, just download the binaries, integrates well with C#)? It just seems crazy to be working with a text representation of such large data sets. – McGarnagle Aug 03 '12 at 16:37
  • does 2nd option mean that new nodes added to xml always append in file append sense? if yes, 2nd is the best way to go. otherwise 1st is better. – Ankush Aug 03 '12 at 17:19
  • In response to the question "What would you say is the best way to go about this?" I think the answer is use a 300 MB database; not a 300 MB XML file. That will make things a lot easier for sure. – Dan Aug 03 '12 at 18:30
  • Do you actually process each of the XML files or do you just need to append a fixed set of nodes at the end, which are independent of the XML document's content? – lsoliveira Aug 03 '12 at 18:43
  • Are you truly just appending? If so, the files will grow indefinitely, which is important to keep in mind. Also, if you're always adding to the end then using a method that requires reading the whole file (such as an XmlReader) will get progressively slower over time. It might be best to combine #1 & #2 by constructing an XML fragment, then using file operations to insert it into the existing file. – Brian Reischl Aug 03 '12 at 18:54
  • How much is it costing your company for you to figure this out VS adding more memory to the server? – Chuck Savage Aug 03 '12 at 20:02

1 Answers1

2

If you wish to be completely certain of the XML structure, using XMLWriter and XMLReader are the best way to go.

However, for absolutely highest possible performance, you may be able to recreate this code quickly using direct string functions. You could do this, although you'd lose the ability to verify the XML structure - if one file had an error you wouldn't be able to correct it:

using (StreamWriter sw = new StreamWriter("out.xml")) {
    foreach (string filename in files) {
        sw.Write(String.Format(@"<inputfile name=""{0}"">", filename));
        using (StreamReader sr = new StreamReader(filename)) {
            // Using .NET 4's CopyTo(); alternatively try http://bit.ly/RiovFX
            if (max_performance) {
                sr.CopyTo(sw);
            } else {
                string line = sr.ReadLine();
                // parse the line and make any modifications you want
                sw.Write(line);
                sw.Write("\n");
            }
        }
        sw.Write("</inputfile>");
    }
}

Depending on the way your input XML files are structured, you might opt to remove the XML headers, maybe the document element, or a few other un-necessary structures. You could do that by parsing the file line by line

Ted Spence
  • 2,598
  • 1
  • 21
  • 21
  • thanks for the CopyStream link - even if I don't use it now I'm sure it will be useful in the future! – Overlord_Dave Aug 06 '12 at 15:09
  • Silly me - I forgot about the .NET 4 `CopyTo` method - explanation here http://msdn.microsoft.com/en-us/library/dd782932.aspx. It's now built in. – Ted Spence Aug 09 '12 at 19:47