I need to generate an XML file and i need to stick as much data into it as possible BUT there is a filesize limit. So i need to keep inserting data until something says no more. How do i figure out the XML file size without repeatably writing it to file?
-
1What do you plan to do when the file size limit is nearing? Stop what you're doing and close all open tags, close the file and open a new one? You'd produce invalid XML in many, if not most cases. – John Saunders Feb 21 '10 at 22:23
-
Just out of curiosity: What will you do if you reach the file size limit and you cannot fit all of the XML into the file? If you stop you will have malformed XML in the file. Is this even acceptable? – Rune Feb 21 '10 at 22:24
-
@John, hah, looks like we had the same thought. Even the naïve approach of closing all open tags is going to be problematic if you have a hard file size limit. You'd have to track all open tags and continually calculate how many bytes are needed to safely close them all. You're entering a world of hurt here... – Aaronaught Feb 21 '10 at 22:25
-
@Rune: I would contend that malformed XML is *never* acceptable, no matter what the requirements seem to dictate. – Aaronaught Feb 21 '10 at 22:26
-
@Aaronaught: Well, if he makes the document span multiple files and only ever uses the content in a scenario where he opens up all the files and concatenates the content of each file to produce a valid document, then I guess it is OK? But it sure ain't pretty :-) – Rune Feb 21 '10 at 22:27
-
@Rune: no, because then the individual files are not valid XML documents. – John Saunders Feb 21 '10 at 22:35
-
@John: Agreed. But if you have 100K worth of XML but, for some reason, you can only store files no larger than 10K on whatever storage you have, you _have_ to make the document span 10 files. When you need the document, read all ten files, stitch their content together and do whatever you like with your now valid document. Of course you shouldn't distribute the files to third parties individually. Anyway, this is probably not relevant to the OP so I'm going to leave it at that :-) – Rune Feb 21 '10 at 22:52
-
Guys, i dont understand the problem. I need to insert a full set or have it fail and not insert any elements. When it doesnt insert i can close the tags and have a valid document. The only hard part is knowing the current filesize (including closing tags) – Feb 21 '10 at 23:14
3 Answers
I agree with John Saunders. Here's some code that will basically do what he's talking about but as an XmlSerializer except as a FileStream and uses a MemoryStream as intermediate storage. It may be more effective to extend stream though.
public class PartitionedXmlSerializer<TObj>
{
private readonly int _fileSizeLimit;
public PartitionedXmlSerializer(int fileSizeLimit)
{
_fileSizeLimit = fileSizeLimit;
}
public void Serialize(string filenameBase, TObj obj)
{
using (var memoryStream = new MemoryStream())
{
// serialize the object in the memory stream
using (var xmlWriter = XmlWriter.Create(memoryStream))
new XmlSerializer(typeof(TObj))
.Serialize(xmlWriter, obj);
memoryStream.Seek(0, SeekOrigin.Begin);
var extensionFormat = GetExtensionFormat(memoryStream.Length);
var buffer = new char[_fileSizeLimit];
var i = 0;
// split the stream into files
using (var streamReader = new StreamReader(memoryStream))
{
int readLength;
while ((readLength = streamReader.Read(buffer, 0, _fileSizeLimit)) > 0)
{
var filename
= Path.ChangeExtension(filenameBase,
string.Format(extensionFormat, i++));
using (var fileStream = new StreamWriter(filename))
fileStream.Write(buffer, 0, readLength);
}
}
}
}
/// <summary>
/// Gets the a file extension formatter based on the
/// <param name="fileLength">length of the file</param>
/// and the max file length
/// </summary>
private string GetExtensionFormat(long fileLength)
{
var numFiles = fileLength / _fileSizeLimit;
var extensionLength = Math.Ceiling(Math.Log10(numFiles));
var zeros = string.Empty;
for (var j = 0; j < extensionLength; j++)
{
zeros += "0";
}
return string.Format("xml.part{{0:{0}}}", zeros);
}
}
To use it, you'd initialize it with the max file length and then serialize using the base file path and then the object.
public class MyType
{
public int MyInt;
public string MyString;
}
public void Test()
{
var myObj = new MyType { MyInt = 42,
MyString = "hello there this is my string" };
new PartitionedXmlSerializer<MyType>(2)
.Serialize("myFilename", myObj);
}
This particular example will generate an xml file partitioned into
myFilename.xml.part001
myFilename.xml.part002
myFilename.xml.part003
...
myFilename.xml.part110

- 9,331
- 2
- 44
- 59
-
I think everyone misunderstood what i meant but your solution is definitely worth the read. – Feb 24 '10 at 02:02
In general, you cannot break XML documents at arbitrary locations, even if you close all open tags.
However, if what you need is to split an XML document over multiple files, each of no more than a certain size, then you should create your own subtype of the Stream
class. This "PartitionedFileStream
" class could write to a particular file, up to the size limit, then create a new file, and write to that file, up to the size limit, etc.
This would leave you with multiple files which, when concatenated, make up a valid XML document.
In the general case, closing tags will not work. Consider an XML format that must contain one element A followed by one element B. If you closed the tags after writing element A, then you do not have a valid document - you need to have written element B.
However, in the specific case of a simple site map file, it may be possible to just close the tags.

- 160,644
- 26
- 247
- 397
-
I can only have one file. I am creating a sitemap. I'm considering only having the most recent X url elements to keep size down. Not the best solution but its probably much easier then size counting. – Feb 21 '10 at 23:18
-
@acidzombie24: so why is there a file size limit? If your site is large, then your sitemap will be large. – John Saunders Feb 21 '10 at 23:51
-
Further to that, arbitrarily truncating a site map would only serve to make the site more difficult for search engines to index and probably result in lower rankings over time. Seems like a silly idea to me. – Aaronaught Feb 22 '10 at 00:07
-
John Saunders: Sitemap has a limit of 50K and 10MB. @Aaronaught: You dont need to provide every link that every existed AFAIK. Just the current ones and the time. – Feb 22 '10 at 03:53
-
@acidzombie24: where does this limit come from? Google? If this is the limit, then don't make your site so large - break it into smaller sites, don't index the lower levels, whatever. But it makes no sense to break the sitemap at some arbitrary point. – John Saunders Feb 22 '10 at 04:11
-
Oh, I get it, it's a Google thing, SEO junk, the 50K is 50,000 distinct URLs. But I think if your site map is bigger than that, it's probably not a very well-designed site... either that or you're trying to include dynamic content in the sitemap, which is just insane. – Aaronaught Feb 22 '10 at 06:27
-
@Aaronaught: One of the reasons for sitemaps IS for dynamic content. and IIRC SO has a huge sitemap of its last 50k questions. – Feb 24 '10 at 00:33
-
@acidzombie24: I bet they simply make no attempt to write more than 50k entries into that sitemap. – John Saunders Feb 24 '10 at 01:16
-
Base on my math the text inside url and changefreq together must be < 208 bytes. My urls are long. I hope sitemaps are still valid if urls are redirected with 301 (i hear redirect, not 301 specifically are invalid/rejected) – Feb 24 '10 at 01:28
-
@acidzombie24: I'm not sure what you're responding to. I would still say, "so, don't write so many URLs". – John Saunders Feb 24 '10 at 01:31
You can ask the XmlTextWriter
for it's BaseStream
, and check it's Position
.
As the other's pointed out, you may need to reserve some headroom to properly close the Xml.
-
In general, it will not be possible to properly close the XML. Just adding end tags for any open tags will not produce valid XML. There may be missing required elements. – John Saunders Feb 21 '10 at 22:36
-
I actually just tried this out and the base stream doesn't seem to be written to until you call `writer.Close();` - the stream position/length are always 0 in the VS2k8 debugger. – Jake Feb 21 '10 at 22:45
-
@John: true, unless the xml is very simple. The requirements smell like some kind of log file format to me, in which case it would work. – Feb 21 '10 at 23:08
-