4

I'm running into a peculiar issue when trying to process large excel files (300mb+) using a data reader. The following code illustrates the way I open the excel file and iterate over the rows in sheet 'largesheet$':

const string inputFilePath = @"C:\largefile.xlsx";
const string connectionString =
    "Provider=Microsoft.ACE.OLEDB.12.0;Extended Properties=\"Excel 12.0;IMEX=1;HDR=YES;\";Data Source=" +
    inputFilePath;

// Initialize connection
using (var connection = new OleDbConnection(connectionString))
{
    // Open connection
    connection.Open();

    // Configure command
    var command = new OleDbCommand("largesheet$", connection) {CommandType = CommandType.TableDirect};

    // Execute reader
    var reader = command.ExecuteReader(); // <-- Completely loads file/sheet into memory

    // Iterate results
    while (reader.HasRows)
    {
        // Read single row
        reader.Read();

        // ...
    }

    // Close connection
    connection.Close();
}

In my understanding this should open the excel file and load each row when needed by using the reader.Read() statement.
However, it appears that the ExecuteReader() statement does more than returning an OleDbDataReader instance. Using breakpoints I noticed that that one statement takes 30s+, and the Windows Resource Monitor indicates a steady increase of allocated memory during the execution of that statement.
Specifying the CommandBehavior parameter (e.g. SequentialAccess) of the ExecuteReader() method has no effect.

What am I doing wrong here? Are there alternative ways of processing large (excel) files?

Note: the IMEX & HDR extended properties of the connection string are intentional.

Edit: After some rational thinking I assume it is not possible to process an excel file without buffering it one way or another. Since excel files are basically a glorified collection of compressed XML files it is not possible to process a worksheet without decompressing it (and keeping it in ram or temporarily saving to disk).
The only alternative I can think of is using Microsoft.Office.Interop.Excel. Not sure how OpenXML handles it though.

chrisv
  • 563
  • 1
  • 6
  • 10

1 Answers1

3

From MSDN: "All rows and columns of the named table or tables will be returned when you call one of the Execute methods of a Command object." (under the Remarks section). So this would appear to be the default behavior of ExecuteReader().

ExecuteReader(CommandBehavior) may give you more options, particularly when CommandBehavior is set to SequentialAccess, though you would need to handle reading at the byte level.

BgRva
  • 1,521
  • 12
  • 26
  • As I stated in my question, the use of _SequentialAccess_ makes no difference (I have only tried this in combination with _TableDirect_). I find it peculiar that ExecuteReader loads the data into memory, I thought one of the advantages of data readers was their agility and performance. Does the .NET framework offer any alternatives that are more suitable for processing large data files? – chrisv Nov 28 '13 at 07:35
  • Have you considered using OpenXML (http://msdn.microsoft.com/en-us/library/office/bb448854.aspx) for your tasks? This may give you more control, depending on what you need to do with the data once it is loaded. You could use the SAX-like reading provided by the SDK to control how the file is read. – BgRva Dec 02 '13 at 15:33