-1

I have to extract specifict information from a too big CSV file with C#, the file is 233 MB in size, it has 1000008 rows and 28 columns and insert tha information in a datatable. actually i tried with OleDB and Aspose.cells, but both ways are too slow

ASPOSE CODE:

public static DataTable CsvReader(string CsvFile)
{
    DataTable DtTemp = new DataTable();
    try
    {
        System.Windows.Forms.Application.DoEvents();
        Aspose.Cells.LoadOptions loadOptions = new LoadOptions(LoadFormat.CSV);
        Aspose.Cells.Workbook workbook = new Aspose.Cells.Workbook(CsvFile, loadOptions);
        string SheetName = workbook.Worksheets[0].Name;
        DtTemp = workbook.Worksheets[SheetName].Cells.ExportDataTable(0, 0, workbook.Worksheets[SheetName].Cells.MaxDataRow + 1, workbook.Worksheets[SheetName].Cells.MaxDataColumn + 1);
        //filter DtTemp datatable based on 4th column
    }
    catch (Exception) { DtTemp.Reset(); }
    return DtTemp;
}

OleDB

public static DataTable CsvReader(string CsvFile)
{
    DataTable DtTemp = new DataTable();
    try
    {
        System.Windows.Forms.Application.DoEvents();
        OleDbConnection conn = null;
        string strConnString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + Path.GetDirectoryName(CsvFile) + "; Extended Properties='text; HDR=NO; IMEX=3; FMT=Delimited(,)';";
        string sql_select;
        conn = new OleDbConnection(strConnString.Trim());
        sql_select = "select * from [" + Path.GetFileName(CsvFile) + "] WHERE F4 Like '%27628%' OR F4 Like '%35627%'";
        conn.Open();
        OleDbCommand cmd = new OleDbCommand(sql_select, conn);
        OleDbDataAdapter obj_oledb_da = new OleDbDataAdapter(cmd);
        DataTable dtSchema = new DataTable();
        obj_oledb_da.FillSchema(dtSchema, SchemaType.Source);
        if (dtSchema != null)
        {
            writeSchema(dtSchema, CsvFile);
            obj_oledb_da.Fill(DtTemp);
        }
    }
    catch (Exception) { DtTemp.Reset(); }
    return DtTemp;
}
dbc
  • 104,963
  • 20
  • 228
  • 340
  • 1
    What are you going to do with it after you read it? Do you really need to load the entire CSV into memory as a `DataTable`, or would it be enough to process it row-by-row? Also, do you know the schema for the CSV file in advance? Could you define some [DTO](https://en.wikipedia.org/wiki/Data_transfer_object) POCO corresponding to the CSV's schema? – dbc Aug 17 '23 at 22:06
  • Also, how slow was OleDB? – dbc Aug 17 '23 at 22:07
  • Oledb would be the fastest method. The only way of speeding up is to make sure the disk is not fragmented, use a faster disk, use a faster computer, and use a computer with more memory. – jdweng Aug 17 '23 at 22:23
  • @jdweng - we really need to know what they're doing with it. If they are uploading to SQL Server via `SqlBulkCopy` then maybe setting [`SqlBulkCopy.EnableStreaming = true`](https://learn.microsoft.com/en-us/dotnet/api/microsoft.data.sqlclient.sqlbulkcopy.enablestreaming?view=sqlclient-dotnet-standard-5.1) and streaming using an `IDataReader` would be faster. Or if they only need to process one row at a time, `CsvHelper` might be better. But, yes, if they really need a `DataTable` then I agree that Oledb is probably best. – dbc Aug 17 '23 at 22:28
  • 3
    "I have to extract specifict information from a too big CSV file" so it sounds like you need only one Information from the big file. Then dont load everything into Memory but process one row after the other and stop if you got it. – Tim Schmelter Aug 17 '23 at 22:45
  • @dbc : Oledb doesn't really open the file so it is always faster than any method that opens the file. – jdweng Aug 17 '23 at 23:08
  • @jdweng "Oledb doesn't really open the file"... uh, that's a remakably bad take on what is happening. It doesn't load the entire file into memory perhaps, but you can't get data from a file without opening it. Oledb streams as much as possible, but it still has to read the entire file. – Corey Aug 18 '23 at 00:57
  • @Corey : Oledb goes through the file structure to access data so it does not open the file using the OS. – jdweng Aug 18 '23 at 09:18
  • @jdweng Again, that's utterly wrong. It uses the standard OS file system APIs to open the file and read the contents. Or do you think it has a way to magically access the data without involving the other components of the OS? Please think about what you're saying. – Corey Aug 18 '23 at 13:09
  • @Corey : I know exactly what I am saying. You can access any data on a file system by accessing the sectors on the disk. The API you are referring to is not using the OS to open. It is accessing the sectors directly. This way you do not have to buffer the data in memory and is much quicker. – jdweng Aug 18 '23 at 14:12
  • @jdweng - if that's true, then it should be faster still to read the CSV file line-by-line using `OleDbCommand.ExecuteReader()`. OP gets the advantage of accessing the sectors directly while avoiding loading everything into a `DataTable`. Agree? – dbc Aug 18 '23 at 21:01
  • @jdweng Oledb uses unbuffered IO and is very fast for random access of structured or indexed data. It still has to process CSV sequentially, just like everything else, and that's not the same as "Oledb doesn't really open the file" and certainly doesn't mean that it "does not open the file using the OS." It's not directly reading sectors off the hard drive, it's using the OS to get the data. – Corey Aug 18 '23 at 21:34
  • @dbc : Yes. All I said was the oledb should be the fastest method. You would need to do a comparison do determine if reading line-by-line or using DataTable is faster. Reading line-by-line may be faster, but you need to include the post processing of the line-by-line result to determine the fastest. Also I like to consider the maintenance and debug as well. Using one line to fill a table which may take a little longer may be better than 20 lines of code that is faster. – jdweng Aug 19 '23 at 06:55
  • If you need to extract large CSV data (line by line) in event driven mode efficiently, you may try using LightCells APIs (https://docs.aspose.com/cells/net/using-lightcells-api/) provided by Aspose.Cells which is light weight mode to get better performance. See the following sample code segment: e.g., var options = new LoadOptions(LoadFormat.CSV); options.LightCellsDataHandler = handler; var book = new Workbook(“TestInputLarge.csv”, options); PS. I am working as Support developer/ Evangelist at Aspose. – Amjad Sahi Aug 19 '23 at 10:16
  • @AmjadSahi - OP already tried Aspose so if you can improve the code shown in their question you should add an answer. – dbc Aug 19 '23 at 10:27
  • @dbc I added my reply with sample code segment there, Thanks. – Amjad Sahi Aug 21 '23 at 10:06

1 Answers1

1

To extract large CSV data (line by line) in an event driven mode efficiently, you may try using LightCells APIs provided by Aspose.Cells for .NET which is a light weight mode to get better performance. Moreover you can fill data into a datatable row by row in the implementation of lightcells APIs.

See the following sample code segment for your complete reference:

e.g.

Sample code:

public void TestExportingDataTable()
{
            //here you may initialize the table according to the data
            //if possible, you can also connect to the database directly and send data to database directly in DataTableExporter
            DataTable dt = new DataTable();
            dt.Columns.Add("Column1", typeof(string));
            dt.Columns.Add("Column2", typeof(DateTime));
            dt.Columns.Add("Column3", typeof(double));
            
            TxtLoadOptions opts = new TxtLoadOptions();
            //if you can optimize the process of parsing data, disbaling the automatic conversion and use your own logic should be better for performance
            opts.ConvertDateTimeData = false;
            opts.ConvertNumericData = false;
            //using lightcells to improve performance for both time cost and memeory cost
            opts.LightCellsDataHandler = new DataTableExporter(dt);
            Workbook wb = new Workbook("largedata.csv", opts);
            //here you may check the generated data table by yourself.
}
private class DataTableExporter : LightCellsDataHandler
{
            private readonly DataTable mTable;
            private DataRow mDataRow;

            public DataTableExporter(DataTable dt)
            {
                mTable = dt;
            }
            public bool StartSheet(Worksheet sheet)
            {
                return true;
            }
            public bool StartRow(int row)
            {
                mDataRow = mTable.NewRow();
                mTable.Rows.Add(mDataRow);
                return true;
            }
            public bool ProcessRow(Row row)
            {
                return true;
            }
            public bool StartCell(int col)
            {
                return true;
            }
            public bool ProcessCell(Cell cell)
            {
                //in this sample, we just put values into the DataTable in memory
                //if you can save those data to database directly, we think it may improve performance significantly too.
                int col = cell.Column;
                if (col == 1)
                {
                    mDataRow[1] = DateTime.Parse(cell.StringValue);
                }
                else if (col == 2)
                {
                    mDataRow[2] = double.Parse(cell.StringValue) + col;
                }
                else
                {
                    mDataRow[col] = cell.StringValue;
                }
                return false;
            }
}

We hope this will help you.

You may also post your queries or discuss further in the dedicated forum.

PS. I am working as Support developer/ Evangelist at Aspose.

Amjad Sahi
  • 1,813
  • 1
  • 10
  • 15