3

Environment : .net 4.0

I have a task that transforms XML files with a XSLT stylesheet, here is my code

public string TransformFileIntoTempFile(string xsltPath, 
    string xmlPath)
{
    var transform = new MvpXslTransform();
    transform.Load(xsltPath, new XsltSettings(true, false), 
        new XmlUrlResolver());

    string tempPath = Path.GetTempFileName();

    using (var writer = new StreamWriter(tempPath))
    {
        using (XmlReader reader = XmlReader.Create(xmlPath))
        {
            transform.Transform(new XmlInput(reader), null, 
                new XmlOutput(writer));
        }       
    }

    return tempPath;
}

I have X threads that can launch this task in parallel. Sometimes my input file are about 300 MB, sometimes it's only a few MB.

My problem : I get OutOfMemoryException when my program try to transform some big XML files in the same time.

How can I avoid these OutOfMemoryEception ? My idea is to stop a thread before executing the task until there is enough available memory, but I don't know how to do that. Or there is some other solution (like putting my task in a distinct application).

Thanks

Tim Lloyd
  • 37,954
  • 10
  • 100
  • 130
remi bourgarel
  • 9,231
  • 4
  • 40
  • 73

4 Answers4

2

I don't recommend blocking a thread. In worst case, you'll just end up starving the task that could potentially free the memory you needed, leading to deadlock or very bad performance in general.

Instead, I suggest you keep a work queue with priorities. Get the tasks from the Queue scheduled fairly across a thread pool. Make sure no thread ever blocks on a wait operation, instead repost the task to the queue (with a lower priority).

So what you'd do (e.g. on receiving an OutOfMemory exception), is post the same job/task onto the queue and terminate the current task, freeing up the thread for another task.

A simplistic approach is to use LIFO which ensures that a task posted to the queue will have 'lower priority' than any other jobs already on that queue.

sehe
  • 374,641
  • 47
  • 450
  • 633
1

Since .NET Framework 4 we have API to work with good old Memory-Mapped Files feature which is available many years within from Win32API, so now you can use it from the .NET Managed Code.

For your task better fit "Persisted memory-mapped files" option, MSDN:

Persisted files are memory-mapped files that are associated with a source file on a disk. When the last process has finished working with the file, the data is saved to the source file on the disk. These memory-mapped files are suitable for working with extremely large source files.

On the page of MemoryMappedFile.CreateFromFile() method description you can find a nice example describing how to create a memory mapped Views for the extremely large file.

EDIT: Update regarding considerable notes in comments

Just found method MemoryMappedFile.CreateViewStream() which creates a stream of type MemoryMappedViewStream which is inherited from a System.IO.Stream. I believe you can create an instance of XmlReader from this stream and then instantiate your custom implementation of the XslTransform using this reader/stream.

EDIT2: remi bourgarel (OP) already tested this approach and looks like this particular XslTransform implementation (I wonder whether ANY would) wont work with MM-View stream in way which was supposed

sll
  • 61,540
  • 22
  • 104
  • 156
  • And now you only have to convince Xslt to use it. – H H Oct 31 '11 at 11:25
  • 1
    Is it compatible with XSLT transformation ? .net load the whole xml file before transforming no ? – remi bourgarel Oct 31 '11 at 11:26
  • @Henk Holterman : I believe this should not be a problem but I can be wrong – sll Oct 31 '11 at 11:26
  • Re the Edit: The original question was already using streams. I don't think a MMF solves anything. – H H Oct 31 '11 at 11:44
  • Yep but it using not MM-streams like MemoryMappedViewStream – sll Oct 31 '11 at 11:59
  • 1
    Is there any evidence to say that the xml classes will not just use the streams to copy the data into memory? – Tim Lloyd Oct 31 '11 at 13:13
  • @chibacity : really I'm not sure whether MM-Stream will give us any benefit vs standard stream. This is a pointer which OP can check himself and provide us more info regarding this new feature in action – sll Oct 31 '11 at 13:18
  • Is there no responsibility to provide tested solutions, or with at least some track record? :) – Tim Lloyd Oct 31 '11 at 13:28
  • @chibacity : yeah it always good when someone gives you a complete solutions with an example application and a set of test harness as well... But sometimes just a pointer to a right way could cost a lot... Just wondering whether you always saw answers with 100% tested solutions, I believe in case of tricky questions `%` of tested solutions goes down due to complexity. So I've posted MM solution because have used it in past (using WIN32API) but have not concernet XSLT transform point, so EDIT part perfectly saying that I'm not sure in this approach and it should be checked anyway – sll Oct 31 '11 at 13:39
  • @sll I'm not suggesting that a complete working solution needs to be provided, I'm asking whether you have used these technologies together yourself (e.g. some level of implementation\testing) and do they work together as you are suggesting? The answers appear to be no, and not sure. – Tim Lloyd Oct 31 '11 at 13:45
  • 2
    @sll , I just tried your solution, and the Xslt Transformation still load the full document (I decompiled the xslt transformation class and it uses XPathDocument which load the entire XML). – remi bourgarel Oct 31 '11 at 13:46
  • @chibacity : I must agree, all absolutely correct, and as remi just said - this implementation of XSLT Transform wont work with MM View Stream – sll Oct 31 '11 at 14:04
  • @remi bourgarel : so looks like a problem in your particular XslTranformation implementation? Could you rewrite it or enhance to use MemoryMappedViewStream ? – sll Oct 31 '11 at 19:47
  • 2
    @sll , it's the case for all the xslt implementation for .net (mvp or not), XSLT needs to load the full XML document. And no I can't rewrite an implementation of Xslt, doesn't seems like a few hours job for me. – remi bourgarel Nov 02 '11 at 08:08
0

The main problem is that you are loading the entire Xml file. If you were to just transform-as-you-read the out of memory problem should not normally appear. That being said I found a MS support article which suggests how it can be done: http://support.microsoft.com/kb/300934

Disclaimer: I did not test this so if you use it and it works please let us know.

Dalibor Čarapić
  • 2,792
  • 23
  • 38
0

You could consider using a queue to throttle how many concurrent transforms are being done based on some sort of artificial memory boundary e.g. file size. Something like the following could be used.

This sort of throttling strategy can be combined with maximum number of concurrent files being processed to ensure your disk is not being thrashed too much.

NB I have not included necessary try\catch\finally around execution to ensure that exceptions are propogated to calling thread and Waithandles are always released. I could go into further detail here.

public static class QueuedXmlTransform
{
    private const int MaxBatchSizeMB = 300;
    private const double MB = (1024 * 1024);
    private static readonly object SyncObj = new object();
    private static readonly TaskQueue Tasks = new TaskQueue();
    private static readonly Action Join = () => { };
    private static double _CurrentBatchSizeMb;

    public static string Transform(string xsltPath, string xmlPath)
    {
        string tempPath = Path.GetTempFileName();

        using (AutoResetEvent transformedEvent = new AutoResetEvent(false))
        {
            Action transformTask = () =>
            {
                MvpXslTransform transform = new MvpXslTransform();

                transform.Load(xsltPath, new XsltSettings(true, false),
                    new XmlUrlResolver());

                using (StreamWriter writer = new StreamWriter(tempPath))
                using (XmlReader reader = XmlReader.Create(xmlPath))
                {
                    transform.Transform(new XmlInput(reader), null,
                        new XmlOutput(writer));
                }

                transformedEvent.Set();
            };

            double fileSizeMb = new FileInfo(xmlPath).Length / MB;

            lock (SyncObj)
            {
                if ((_CurrentBatchSizeMb += fileSizeMb) > MaxBatchSizeMB)
                {
                    _CurrentBatchSizeMb = fileSizeMb;

                    Tasks.Queue(isParallel: false, task: Join);
                }

                Tasks.Queue(isParallel: true, task: transformTask);
            }

            transformedEvent.WaitOne();
        }

        return tempPath;
    }

    private class TaskQueue
    {
        private readonly object _syncObj = new object();
        private readonly Queue<QTask> _tasks = new Queue<QTask>();
        private int _runningTaskCount;

        public void Queue(bool isParallel, Action task)
        {
            lock (_syncObj)
            {
                _tasks.Enqueue(new QTask { IsParallel = isParallel, Task = task });
            }

            ProcessTaskQueue();
        }

        private void ProcessTaskQueue()
        {
            lock (_syncObj)
            {
                if (_runningTaskCount != 0) return;

                while (_tasks.Count > 0 && _tasks.Peek().IsParallel)
                {
                    QTask parallelTask = _tasks.Dequeue();

                    QueueUserWorkItem(parallelTask);
                }

                if (_tasks.Count > 0 && _runningTaskCount == 0)
                {
                    QTask serialTask = _tasks.Dequeue();

                    QueueUserWorkItem(serialTask);
                }
            }
        }

        private void QueueUserWorkItem(QTask qTask)
        {
            Action completionTask = () =>
            {
                qTask.Task();

                OnTaskCompleted();
            };

            _runningTaskCount++;

            ThreadPool.QueueUserWorkItem(_ => completionTask());
        }

        private void OnTaskCompleted()
        {
            lock (_syncObj)
            {
                if (--_runningTaskCount == 0)
                {
                    ProcessTaskQueue();
                }
            }
        }

        private class QTask
        {
            public Action Task { get; set; }
            public bool IsParallel { get; set; }
        }
    }
}

Update

Fixed bug in maintaining batch size when rolling over to next batch window:

_CurrentBatchSizeMb = fileSizeMb;
Tim Lloyd
  • 37,954
  • 10
  • 100
  • 130