4

This is a C#/VSTO program. I've been working on a data capture project. The scope is basically 'process Excel files sent by a variety of third party companies.' Practically, this mean:

  1. Locate columns that contain the data I want through a search method.
  2. Grab data out of the workbooks
  3. Clean the data, run some calculations, etc
  4. Output cleaned data into new workbook

The program I have written works great for small-medium data sets, ~25 workbooks with a combined total of ~1000 rows of relavent data. I'm grabbing 7 columns with of data out of these workbooks. One edge case I have, though, is occasionally I will need to run a much larger data set, ~50 workbooks with a combined total of ~8,000 rows of relavent data (and possibly another ~2000 of duplicate data that I also have to remove).

I am currently putting a list of the files through a Parallel.ForEach loop inside of which I open a new Excel.Application() to process each file with multiple ActiveSheets. The parallel process runs much faster on the smaller data set than going through each one sequentially. But on the larger data set, I seem to hit a wall.

I start getting the message: Microsoft Excel is waiting for another application to complete an OLE action and eventually it just fails. Switching back to sequential foreach does allow the program to finish, but it just grinds along - going from 1-3 minutes for a Parallel medium sized data set to 20+ minutes for a sequential large data set. If I mess with ParallelOptions.MaxDegreeOfParallelism set to 10 it will complete the cycle, but still take 15 minutes. If I set it to 15, it fails. I also really don't like messing with TPL settings if I don't have to. I've also tried inserting a Thread.Sleep to just manually slow things down, but that only made the failure happen further out.

I close the workbook, quit the application, then ReleaseComObject to the Excel object and GC.Collect and GC.WaitForPendingFinalizers at the end of each loop.

My ideas at the moment are:

  1. Split the list in half and run them seperately
  2. Open some number of new Excel.Application() in parallel, but run a list of files sequentially inside of that Excel instance (so kinda like #1, but using a different path)
  3. Seperate the list by file size, and run a small set of very large files independently/sequentially, run the rest as I have been

Things I am hoping to get some help with:

  1. Suggestions on making real sure my memory is getting cleared (maybe Process.Id is getting twisted up in all the opening and closing?)
  2. Suggestions on ordering a parallel process - I'm wondering if I can throw the 'big' guys in first, that will make the longer-running process more stable.

I have been looking at: http://reedcopsey.com/2010/01/26/parallelism-in-net-part-5-partitioning-of-work/ and he says "With prior knowledge about your work, it may be possible to partition data more meaningfully than the default Partitioner." But I'm having a hard time really knowing what/if partitioning makes sense.

Really appreciate any insights!

UPDATE

So as a general rule I test against Excel 2010, as we have both 2010 and 2013 under use here. I ran it against 2013 and it works fine - run time about 4 minutes, which is about what I would expect. Before I just abandon 2010 compatibility, any other ideas? The 2010 machine is a 64-bit machine with 64-bit Office, and the 2013 machine is a 64-bit machine with a 32-bit Office. Would that matter at all?

getglad
  • 2,514
  • 3
  • 24
  • 47
  • I wonder if deadlock is occurring because you've exhausted all available threads in the thread pool? You might try increasing the limit with ThreadPool.SetMaxThreads to see if that helps. – RogerN Jun 27 '14 at 20:42
  • 1
    50 Excel files to read is actually not a big deal provided that each workbook contains less than 10,000 rows (as per your problem definition). Try to optimize your file reader, in particular convert Excel ranges to C# array; as soon as you get Excel ranges converted to C# arrays, the rest of processing should run really fast (in general). Rgds, – Alexander Bell Jun 27 '14 at 21:49
  • Thanks! just getting into work - let I'll look into both of these ideas and report back – getglad Jun 30 '14 at 14:52
  • @AlexBell So you're recommending something like using `object[,] values = range.Value as object[,]` and process it from there. The idea being it would let me grab data out fast and get the Excel instances out of the way? – getglad Jun 30 '14 at 19:25
  • Yes, something like this: in case you know the data types, then array could be strongly typed (e.g. string[], or string[,], or string[][] whichever is the best fit for your case). Rgds, – Alexander Bell Jun 30 '14 at 19:30
  • @AlexBell - This is looking like a 2010 v 2013 issue (see update). All the sheets open fine in 2010 without throwing errors. – getglad Jun 30 '14 at 20:19

2 Answers2

1

A few years ago i worked with excel files and automation. I then had problems of having zombie processes in task manager. Although our program ended and i thought i quit excel properly, the processes were not quitting.

The solution was not something i liked but it was effective. I can summarize the solution like this.

1) never use two dots consecutively like:

workBook.ActiveSheet.PageSetup

instead use variables.. when you are done relase and null them.

example: instead of doing this:

m_currentWorkBook.ActiveSheet.PageSetup.LeftFooter = str.ToString();

follow the practices in this function. (This function adds a barcode to excel footer.)

    private bool SetBarcode(string text)
    {
            Excel._Worksheet sheet;
            sheet = (Excel._Worksheet)m_currentWorkbook.ActiveSheet;
            try
            {
                StringBuilder str = new StringBuilder();
                str.Append(@"&""IDAutomationHC39M,Regular""&22(");
                str.Append(text);
                str.Append(")");

                Excel.PageSetup setup;
                setup = sheet.PageSetup;
                try
                {
                    setup.LeftFooter = str.ToString();
                }
                finally
                {
                    RemoveReference(setup);
                    setup = null;
                }
            }
            finally
            {
                RemoveReference(sheet);
                sheet = null;
            }

            return true;

    }

Here is the RemoveReference function (putting null in this function did not work)

    private void RemoveReference(object o)
    {
        try
        {
            System.Runtime.InteropServices.Marshal.ReleaseComObject(o);
        }
        catch
        { }
        finally
        {
            o = null;
        }
    }

If you follow this pattern EVERYWHERE it guarantees no leaks, no zombie processes etc..

2) In order to create excel files you can use excel application, however to get data from excel, i suggesst using OleDB. You can approach excel like a database and get data from it with sql queries, datatables etc.

Sample Code: (instead of filling dataset, you can use datareader for memory performance)

    private List<DataTable> getMovieTables()
    {
        List<DataTable> movieTables = new List<DataTable>();
        var connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + excelFilePath + ";Extended Properties=\"Excel 12.0;IMEX=1;HDR=NO;TypeGuessRows=0;ImportMixedTypes=Text\""; ;
        using (var conn = new OleDbConnection(connectionString))
        {
            conn.Open();

            DataRowCollection sheets = conn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, new object[] { null, null, null, "TABLE" }).Rows;

            foreach (DataRow sheet in sheets)
            {

                using (var cmd = conn.CreateCommand())
                {
                    cmd.CommandText = "SELECT * FROM [" + sheet["TABLE_NAME"].ToString() + "] ";

                    var adapter = new OleDbDataAdapter(cmd);
                    var ds = new DataSet();
                    try
                    {
                        adapter.Fill(ds);
                        movieTables.Add(ds.Tables[0]);
                    }
                    catch (Exception ex)
                    {
                        //Debug.WriteLine(ex.ToString());
                        continue;
                    }
                }
            }
        }
        return movieTables;
    }
  • I have a lot of poorly structured data, and I won't always know where my data is located, thus my step #1 of locating the data. To be sure I am following, your suggestion would then be that I build my database/dataset by calling all these spreadsheets (1 iteration of connectionString per file), and then perform my 4 steps? – getglad Aug 10 '14 at 22:10
0

As an alternative solution to the one proposed by @Mustafa Düman I recommend you to use Version 4 beta of EPPlus. I used it without problems in several projects.

Pros:

  • Fast
  • No memory leaks (I can't tell the same for versions <4)
  • Does not require Office to be installed on the machine where you use it

Cons:

  • Can be used only for .xlsx files ( Excel 2007 / 2010 )

I tested it with the following code on 20 excel files around 12.5 MB each (over 50k records in each file) and I think it's enough to mention that it didn't crashed :)

 Console.Write("Path: ");
 var path = Console.ReadLine();
 var dirInfo = new DirectoryInfo(path);

 while (string.IsNullOrWhiteSpace(path) || !dirInfo.Exists)
 {
     Console.WriteLine("Invalid path");
     Console.Write("Path: ");
     path = Console.ReadLine();
     dirInfo = new DirectoryInfo(path);
 }

 string[] files = null;
 try
 {
     files = Directory.GetFiles(path, "*.xlsx", SearchOption.AllDirectories);
 }
 catch (Exception ex)
 {
     Console.WriteLine(ex.Message);
     Console.ReadLine();
     return;
 }

 Console.WriteLine("{0} files found.", files.Length);

 if (files.Length == 0)
 {
     Console.ReadLine();
     return;
 }

 int succeded = 0;
 int failed = 0;


 Action<string> LoadToDataSet = (filePath) =>
 {
     try
     {
         FileInfo fileInfo = new FileInfo(filePath);
         using (ExcelPackage excel = new ExcelPackage(fileInfo))
         using (DataSet dataSet = new DataSet())
         {
             int workSheetCount = excel.Workbook.Worksheets.Count;

             for (int i = 1; i <= workSheetCount; i++)
             {
                 var worksheet = excel.Workbook.Worksheets[i];

                 var dimension = worksheet.Dimension;
                 if (dimension == null)
                     continue;

                 bool hasData = dimension.End.Row >= 1;

                 if (!hasData)
                     continue;

                 DataTable dataTable = new DataTable();

                 //add columns
                 foreach (var firstRowCell in worksheet.Cells[1, 1, 1, dimension.End.Column])
                 dataTable.Columns.Add(firstRowCell.Start.Address);

                 for (int j = 0; j < dimension.End.Row; j++)
                     dataTable.Rows.Add(worksheet.Cells[j + 1, 1, j + 1, dimension.End.Column].Select(erb => erb.Value).ToArray());

                 dataSet.Tables.Add(dataTable);
             }

             dataSet.Clear();
             dataSet.Tables.Clear();
         }

         Interlocked.Increment(ref succeded);
     }
     catch (Exception)
     {
         Interlocked.Increment(ref failed);
     }
 };

 Stopwatch sw = new Stopwatch();

 sw.Start();
 files.AsParallel().ForAll(LoadToDataSet);
 sw.Stop();

 Console.WriteLine("{0} succeded, {1} failed in {2} seconds", succeded, failed, sw.Elapsed.TotalSeconds);
 Console.ReadLine();
B0Andrew
  • 1,725
  • 13
  • 19
  • Hey - I apologize for not responding here. Very cool package. I can't use it for this specific project due to some other parameters, but I will keep this in my back pocket in the future. Thnx! – getglad Aug 10 '14 at 21:53