5

I'm reading data from StreamReader line by line inside the following while statement.

while (!sr.EndOfStream)
{
   string[] rows = sr.ReadLine().Split(sep);

   int incr = 0;
   foreach (var item in rows)
   {
       if (item == "NA" | item == "" | item == "NULL" | string.IsNullOrEmpty(item) | string.IsNullOrWhiteSpace(item))
       {
           rows[incr] = null;
       }
       ++incr;
   }
    // another logic ...
}

The code works fine but it is very slow because of huge csv files (500,000,000 rows and hundreds of columns). Is there any faster way how to check data (if it is "NA", "", ... should be replaced by null). Currently I'm using foreach with incr variable for updating item inside foreach.

I was wondering about linq or lambda would be faster but I'm very new in these areas.

bkardol
  • 1,258
  • 1
  • 20
  • 32
mateskabe
  • 289
  • 2
  • 13
  • linq is unlikely to be any faster. One minor thing you can do is get rid of the 'item == ""' check as you already have string.IsNullOrEmpty(item), also the | between the IsNullOrEmpty and IsNullOrWhitespace checks won't short-circut change that to ||. – SoronelHaetir Dec 29 '17 at 08:44
  • I suggest you save your data from CSV to database (relational or NoSQL on the circumstances), and then work with it. It will be fast. – Alexander Petrov Dec 29 '17 at 09:13
  • What is `// another logic`? Loop (`for` is a better choice than `foreach`) should be quite fast – Dmitry Bychenko Dec 29 '17 at 09:18
  • Another logic is SqlBulkCopy part. But this is not the object of this task. – mateskabe Dec 29 '17 at 09:24
  • I'm reading data from StreamReader line by line so rows seems like logical name for variable in this case. First part of bottleneck is data checking inside while statement and second seems to be bulk copy which I'll post in the separated task. – mateskabe Dec 29 '17 at 10:30
  • Check out this: https://stackoverflow.com/questions/17188357/read-large-txt-file-multithreaded – Ofir Winegarten Dec 29 '17 at 11:27
  • 1
    Have you profiled the program? The title mentions data checking inside the loop, but I would have thought that the time to do the checking and updating would be insignificant compared to the time to read from the file and split into columns (what you call rows) – sgmoore Dec 29 '17 at 13:32
  • Reading data from flat file across the lines (what you call columns) doesn't seem to be bottleneck. I'm trying to figure out how to speed up bulkcopy procedure. This should help program's performance. – mateskabe Dec 29 '17 at 19:41
  • And then adding to datatable as row (dt.Rows.Add(rows)). I would stay in "rows" approach. – mateskabe Dec 30 '17 at 09:17
  • `dt.Rows.Add(rows)` In that case, `row` would be a better variable name. – mjwills Dec 30 '17 at 10:01
  • :-) definitely improves my solution – mateskabe Jan 02 '18 at 11:11

1 Answers1

5

Firstly, don't use foreach when changing the collection, it's not a good habit, especially when you already use a counter variable.

This loop could be made multi-threaded using Parallel.For this way:

Code using normal for:

while (!sr.EndOfStream)
{
    string[] rows = sr.ReadLine().Split(sep);

    for (int i = 0; i < rows.Length; i++)
    {
        //I simplified your checks, this is safer and simplier.
        if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
        {
            rows[i] = null;
        }
    }
    // another logic ...
}

Code using Parallel.For

while (!sr.EndOfStream)
{
    string[] rows = sr.ReadLine().Split(sep);

    Parallel.For(0, rows.Length, i =>
    {
        if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
        {
            rows[i] = null;
        }
    });
    // another logic ...
}

EDIT

We could approach this from another side, but I don't recommend this, because this requires a LOT of RAM, because it has to read the entire file into memory.

string[] lines = File.ReadAllLines("test.txt");
Parallel.For(0, lines.Length, x =>
{
    string[] rows = lines[x].Split(sep);

    for (int i = 0; i < rows.Length; i++)
    {
        if (string.IsNullOrWhiteSpace(rows[i]) || rows[i] == "NA" || rows[i] == "NULL")
        {
            rows[i] = null;
        }
    }
});

But I don't think that this is worth it. You decide. These kinds of operations don't play well with parallelization, because they take so little time to compute, that it is too much overhead.

Salah Akbari
  • 39,330
  • 10
  • 79
  • 109
rokkerboci
  • 1,167
  • 1
  • 7
  • 14