6

I have a file collection (3000 files) in a FileInfoCollection. I want to process all the files by applying some logic which is independent (can be executed in parallel).

 FileInfo[] fileInfoCollection = directory.GetFiles();
 Parallel.ForEach(fileInfoCollection, ProcessWorkerItem);

But after processing about 700 files I am getting an out of memory error. I used Thread-pool before but it was giving same error. If I try to execute without threading (parallel processing) it works fine.

In "ProcessWorkerItem" I am running an algorithm based on the string data of the file. Additionally I use log4net for logging and there are lot of communications with the SQL server in this method.

Here are some info, Files size : 1-2 KB XML files. I read those files and the process is dependent on the content of the file. It is identifying some keywords in the string and generating another XML format. Keywords are in the SQL server database (nearly 2000 words).

Jayantha Lal Sirisena
  • 21,216
  • 11
  • 71
  • 92

3 Answers3

7

Well, what does ProcessWorkerItem do? You may be able to change that to use less memory (e.g. stream the data instead of loading it all in at once) or you may want to explicitly limit the degree of parallelism using this overload and ParallelOptions.MaxDegreeOfParallelism. Basically you want to avoid trying to process all 3000 files at once :) IIRC, Parallel Extensions will "notice" if your tasks appear to be IO bound, and allow more than the normal number to execute at once - which isn't really what you want here, as you're memory bound as well.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Can you please give me some points to decide the MaxDegreeOfParallelism ? How do I decide the best value? – Jayantha Lal Sirisena May 11 '11 at 09:18
  • 1
    @Jayantha: It entirely depends on what your ProcessWorkerItem method is doing. How many files do you think you can reasonably process at a time? – Jon Skeet May 11 '11 at 09:19
  • Logically processing of those files are independent from each other so I can process any number of files :). But hardware limitations. I have core i5 processor with 4 GB RAM. For processing a single file it takes average 1 second – Jayantha Lal Sirisena May 11 '11 at 09:29
  • 2
    @Jayantha: But you *still* haven't explained what you're doing with the files, or how big they are, or why you would run out of memory. If you're doing something that requires 1GB per file, then you're not going to be able to process more than two or three at a time... whereas if you're doing something trivial, you could potentially process hundreds. – Jon Skeet May 11 '11 at 09:32
  • I have added some info above . – Jayantha Lal Sirisena May 11 '11 at 09:34
  • I tested setting MaxDegreeOfParallelism =10 now I can go up to 1500 files. Then again same issue. It is not going to release memory after the processing of files . – Jayantha Lal Sirisena May 11 '11 at 10:12
  • @Jayantha: Perhaps you've just got a bug in your processing code then? – Jon Skeet May 11 '11 at 10:56
  • Then how do I identify this memory usage issue? Do I need to have a memory profiler kind of software to see the memory allocation? – Jayantha Lal Sirisena May 11 '11 at 11:19
  • 2
    @Jayantha: Well I'd start by getting rid of the parallelism. Can you run the whole thing in series with no problems? If so, it sounds like maybe you just need to reduce the degree of parallelism. If not, get a profiler out and examine what objects are hanging around when they shouldn't. – Jon Skeet May 11 '11 at 11:27
  • I can run it smoothly without parallelism .I will try it with less degree. – Jayantha Lal Sirisena May 11 '11 at 11:34
2

If you're attempting operations on large files in parallel then it's feasible that you would run out of available memory.

Maybe consider trying out Rx extensions and using it's Throttle method to control/compose your processing?

Nathan
  • 6,095
  • 2
  • 35
  • 61
0

I found the bug which raised the memory leak, I as using Unit Of Work pattern with entity framework. In unit of work I keep the context in a hash table with thread name as the hash key. When I use threading the hash table keeps growing and it cased the memory leak. So I added additional method to unit of work to remove the element from hash table after completing the task of a thread.

public static void DisposeUnitOfWork()
        {
            IUnitOfWork unitOfWork = GetUnitOfWork();

            if (unitOfWork != null)
            {
                unitOfWork.Dispose();
                hashTable.Remove(Thread.CurrentThread.Name);


            }
        }
Jayantha Lal Sirisena
  • 21,216
  • 11
  • 71
  • 92