This is a C#/VSTO program. I've been working on a data capture project. The scope is basically 'process Excel files sent by a variety of third party companies.' Practically, this mean:
- Locate columns that contain the data I want through a search method.
- Grab data out of the workbooks
- Clean the data, run some calculations, etc
- Output cleaned data into new workbook
The program I have written works great for small-medium data sets, ~25 workbooks with a combined total of ~1000 rows of relavent data. I'm grabbing 7 columns with of data out of these workbooks. One edge case I have, though, is occasionally I will need to run a much larger data set, ~50 workbooks with a combined total of ~8,000 rows of relavent data (and possibly another ~2000 of duplicate data that I also have to remove).
I am currently putting a list of the files through a Parallel.ForEach
loop inside of which I open a new Excel.Application()
to process each file with multiple ActiveSheet
s. The parallel process runs much faster on the smaller data set than going through each one sequentially. But on the larger data set, I seem to hit a wall.
I start getting the message: Microsoft Excel is waiting for another application to complete an OLE action
and eventually it just fails. Switching back to sequential foreach
does allow the program to finish, but it just grinds along - going from 1-3 minutes for a Parallel medium sized data set to 20+ minutes for a sequential large data set. If I mess with ParallelOptions.MaxDegreeOfParallelism
set to 10 it will complete the cycle, but still take 15 minutes. If I set it to 15, it fails. I also really don't like messing with TPL settings if I don't have to. I've also tried inserting a Thread.Sleep
to just manually slow things down, but that only made the failure happen further out.
I close the workbook, quit the application, then ReleaseComObject
to the Excel object and GC.Collect
and GC.WaitForPendingFinalizers
at the end of each loop.
My ideas at the moment are:
- Split the list in half and run them seperately
- Open some number of
new Excel.Application()
in parallel, but run a list of files sequentially inside of that Excel instance (so kinda like #1, but using a different path) - Seperate the list by file size, and run a small set of very large files independently/sequentially, run the rest as I have been
Things I am hoping to get some help with:
- Suggestions on making real sure my memory is getting cleared (maybe
Process.Id
is getting twisted up in all the opening and closing?) - Suggestions on ordering a parallel process - I'm wondering if I can throw the 'big' guys in first, that will make the longer-running process more stable.
I have been looking at: http://reedcopsey.com/2010/01/26/parallelism-in-net-part-5-partitioning-of-work/ and he says "With prior knowledge about your work, it may be possible to partition data more meaningfully than the default Partitioner." But I'm having a hard time really knowing what/if partitioning makes sense.
Really appreciate any insights!
UPDATE
So as a general rule I test against Excel 2010, as we have both 2010 and 2013 under use here. I ran it against 2013 and it works fine - run time about 4 minutes, which is about what I would expect. Before I just abandon 2010 compatibility, any other ideas? The 2010 machine is a 64-bit machine with 64-bit Office, and the 2013 machine is a 64-bit machine with a 32-bit Office. Would that matter at all?