4

I'm facing a weird performance difference when I read data from a large csv file. If I read data and construct a dictionary in the same loop, as the code snippet below shows, then the method will cost about 4.1 seconds to finish the process.

private void ReadFileWorkerRun(object sender, EventArgs e)
{
    List<Stock> lineTemp = new List<Stock>();
    List<Stock> allStock = new List<Stock>();
    List<List<Stock>> orderedAll = new List<List<Stock>>();
    Categories = new Dictionary<string, List<Stock>>() { { GlobalVariable.ALL, allStock } };
    DictionaryOrder = new List<(string, string)>();

    using (StreamReader lines = new StreamReader(FilePath))
    {
        string line = lines.ReadLine();

        // Add each stock to dictionary
        while ((line = lines.ReadLine()) != null)
        {
            Stock temp = new Stock(line);

            // This is the upper boundary of the code that will move outside of the using statement
            if (!Categories.TryGetValue(temp.StockID, out List<Stock> targetList))
            {
                targetList = new List<Stock>();
                orderedAll.Add(targetList);
                Categories.Add(temp.StockID, targetList);
                DictionaryOrder.Add((temp.StockID, temp.StockName));
            }
            targetList.Add(temp);
            // This is the lower boundary of the code that will move outside of the using statement
        }
    }
    /*
    The code between the boundry is moved here
    */
    foreach (List<Stock> stockList in orderedAll)
    {
        allStock.AddRange(stockList);
    }
}
public class Stock
    {
        public string StockDate { get; set; }

        public string StockID { get; set; }

        public string StockName { get; set; }

        public string SecBrokerID { get; set; }

        public string SecBrokerName { get; set; }

        public decimal Price { get; set; }

        public long BuyQty{ get; set; }

        public long SellQty { get; set; }

        public Stock(string s)
        {
            string[] data = s.Split(',');
            StockDate = data[0];
            StockID = data[1];
            StockName = data[2];
            SecBrokerID = data[3];
            SecBrokerName = data[4];
            Price = decimal.Parse(data[5]);
            BuyQty = long.Parse(data[6]);
            SellQty = long.Parse(data[7]);
        }
    }

However, when I move the part of the code that constructs the dictionary out of the while loop and put it into a foreach loop, then the time the method took becomes 3.4 seconds. The code in the using statement is separated into the code below:

using (StreamReader lines = new StreamReader(FilePath))
{
    string line = lines.ReadLine();

    while ((line = lines.ReadLine()) != null)
    {
            lineTemp.Add(new Stock(line));
    }
}

// Add each stock to dictionary
foreach (Stock temp in lineTemp)
{
    if (!Categories.TryGetValue(temp.StockID, out List<Stock> targetList))
    {
        targetList = new List<Stock>();
        orderedAll.Add(targetList);
        Categories.Add(temp.StockID, targetList);
        DictionaryOrder.Add((temp.StockID, temp.StockName));
    }
    targetList.Add(temp);
}

The only difference between two versions is the code I list in the second part and the time gap is always consistent, no matter how many times I run, so how come the codes with the same logic and data structure act so differently?

謝康豪
  • 41
  • 2
  • 2
    Have you tried ran the program multiple time to see if you are consistently getting the difference, As you are reading from file may be some time it will be slow next time little fast, Try run both the program 4-5 time to see the average difference you are getting. – mksmanjit Jul 01 '22 at 07:21
  • 2
    Welcome to Stack Overflow. "I'm new to stackoverflow, and if I neglect some tips for asking a proper question, feel free to notice me." In my view, writing this is actually the thing you really got wrong - this is **not a discussion forum**, so the question should **only** contain the question itself (and the related code etc. needed to understand it). It's not necessary to make excuses for yourself; we are supposed to be polite anyway, and we will help you fix what we can, and tell you about the things where your input is needed. – Karl Knechtel Jul 01 '22 at 07:26
  • If you really want to make the question the best it can be, though, please read [mre], and consider if the code can be simplified while still demonstrating the problem. – Karl Knechtel Jul 01 '22 at 07:26
  • Yes, I do. I have ran several times and the result is always consistent. – 謝康豪 Jul 01 '22 at 07:38
  • How did you count the time? Did the program run in release mode? Does `Stock` have relevant code to the question? – shingo Jul 01 '22 at 07:45
  • @shingo Thanks. I added the detail of the `Stock` class. I count the time by simply using a Stopwatch and the method to count time is the same in both scenarios. – 謝康豪 Jul 01 '22 at 08:01
  • @KarlKnechtel I appreciate your advice, and I'll do my best to make the question better and more readable. – 謝康豪 Jul 01 '22 at 08:03
  • 1
    I would suggest measuring against a memory stream rather than an actual file. Measuring anything involving IO can be quite complicated due to caching etc. So I would recommend copying your data into a memory stream first. – JonasH Jul 01 '22 at 08:36
  • I've dived into internals of `AddRange()` and `Add()` operations. Which I believe is the issue. The problem is that `AddRange()` actually performs worse than `Add()`, if called in a tight loop with small lists. That's because `AddRange()` performs an additional allocation and a copy inside. See this: https://stackoverflow.com/questions/2123161/listt-addrange-implementation-suboptimal – freakish Jul 01 '22 at 10:02
  • @謝康豪 to test that, in the first snippet please change: `AddRange()` to a foreach with `Add()`. And please let me know about the result. I wonder whether that's the issue. – freakish Jul 01 '22 at 10:02
  • Another potential slow down may come from the fact that a list of lists is slower than just a list. Because of the additional dereferencing. Again, especially true if lists are small, like 1 element each. – freakish Jul 01 '22 at 10:07
  • @freakish Although there might has performance issue in `AddRange` method, **both** of the snippet call `AddRange`. To be more specific, I added some comments in the upper snippet. – 謝康豪 Jul 06 '22 at 00:41
  • @謝康豪 right, I misread your code. I've managed to reproduce your result. I removed `allStock`, `orderedAll` and `DictionaryOrder` collections, removed final `.AddRange()` loop and replaced file i/o with MemoryStream. I've also tested this with different `Stock` class fields. The difference in performance is still there (even though no i/o is performed). The only explanation I can think of is that this is due to CPU cache behaviour or branch predictor misbehaving. – freakish Jul 06 '22 at 09:10

0 Answers0