Improve RAM usage behaviour to avoid lags

Question

We have a problem which seems to be caused by the constant allocation and deallocation of memory:

We have a rather complex system here, where a USB device is measuring arbitrary points and sending the measurement data to the PC at a rate of 50k samples per second. These samples are then collected as MeasurementTasks in the software for each point and afterwards processed which causes even more needed memory because of the requirements of the calculations.
Simplified each MeasurementTask looks like the following:

public class MeasurementTask
{
    public LinkedList<Sample> Samples { get; set; }
    public ComplexSample[] ComplexSamples { get; set; }
    public Complex Result { get; set; }
}

Where Sample looks like:

public class Sample
{
    public ushort CommandIndex;
    public double ValueChannel1;
    public double ValueChannel2;
}

and ComplexSample like:

public class ComplexSample
{
    public double Channel1Real;
    public double Channel1Imag;

    public double Channel2Real;
    public double Channel2Imag; 
}

In the calculation process the Samples are first calculated into a ComplexSample each and then futher processed until we get our Complex Result. After these calculations are done we release all the Sample and ComplexSample instances and the GC cleans them up soon after, but this results in a constant "up and down" of the memory usage.
This is how it looks at the moment with each MeasurementTask containing ~300k samples: RAM usage

Now we have sometimes the problem that the samples buffer in our HW device is overflown, as it can only store ~5000 samples (~100ms) and it seems the application is not always reading the device fast enough (we use BULK transfer with LibUSB/LibUSBDotNet). We tracked this problem down to this "memory up and down" by the following facts:

the reading from the USB device happens in its own thread which runs at ThreadPriority.Highest, so the calculations should not interfere
CPU usage is between 1-5% on my 8-core CPU => <50% of one core
if we have (much) faster MeasurementTasks with only a few hundret samples each, the memory goes only up and down very little and the buffer never overflows (but the amount of instances/second is the same, as the device still sends 50k samples/second)
we had a bug before, which did not release the Sample and ComplexSample instances after the calculations and so the memory only went up at ~2-3 MB/s and the buffer overflew all the time

At the moment (after fixing the bug mentioned above) we have a direct correlation between the samples count per point and the overflows. More samples/point = higher memory delta = more overflows.

Now to the actual question: Can this behaviour be improved (easily)?
Maybe there is a way to tell the GC/runtime to not release the memory so there is no need to re-allocate?

We also thought of an alternative approach by "re-using" the LinkedList<Sample> and ComplexSample[]: Keep a pool of such lists/arrays and instead of releasing them put them back in the pool and "change" these instances instead of creating new ones, but we are not sure this is a good idea as it adds complexity to the whole system...
But we are open to other suggestions!

UPDATE:
I now optimized the code base with the following improvements and did various test runs:

converted Sample to a struct
got rid of the LinkedList<Sample> and replaced them by straigt arrays (I actually had another one somewhere else I also removed)
several minor optimizations I found during analysis and optimization
(optional - see below) converted ComplexSample to a struct

In any case it seems that the problem is gone now on my machine (long term tests and test on low-spec hardware will follow), but I first run a test with both types as struct and got the following memory usage graph:
structs
There it still was going up to ~300 MB on a regular basis (but no overflow errors anymore), but as this still seemed odd to me I did some additional tests:

Side note: Each value of each ComplexSample is altered at least once during the calculations.

1) Add a GC.Collect after a task is processed and the samples are not referenced any more:
Struct with GC.Collect
Now it was alternating between 140 MB and 150 MB (no noticable perfomance hit).

2) ComplexSample as a class (no GC.Collect):
class
Using a class it is much more "stable" at ~140-200 MB.

3) ComplexSample as a class and GC.Collect:
class with GC.Collect
Now it is going "up and down" a little in the range of 135-150 MB.

Current solution:
As we are not sure this is a valid case for manually calling GC.Collect we are using "solution 2)" now and I will start running the long-term (= several hours) and low-spec hardware tests...

Few things you may do (but keep in mind that locks may block threads, check also for them). Change XyzSample classes to structs and reuse lists as much as possible (I assume input data rate isn't fixed because of HW buffer). Hide that complexity with a _factory_ class (so you will be free to change - in future - algorithm to something more sophisticated or to try different solutions). That said I'd investigate more on this: memory allocation in managed world (especially for small objects) is pretty fast and I see no reason it blocks USB input: I saw apps performing much worse than this... — Adriano Repetti, May 27 '15 at 12:02
BTW here I just guess because I never tried libusbdotnet but also marshaling has - usually - a terrible impact on performance (it may worth to write low level stuff in C++/CLI to handle unmanaged lib without marshaling and directly throw out managed objects). — Adriano Repetti, May 27 '15 at 12:05
Note that you have a big waste of memory in Sample: you are using 18 bytes, but the structure will be aligned to 24 bytes... Think if it would be possible to "extract" the `CommandIndex` — xanatos, May 27 '15 at 12:13
This may be of interest: http://www.grobmeier.de/log4j-2-performance-close-to-insane-20072013.html#.VWW4XTXhk_s — SJuan76, May 27 '15 at 12:29

score 3 · Accepted Answer · edited May 23 '17 at 12:14

Can this behaviour be improved (easily)?

Yes (depends on how much you need to improve it).

The first thing I would do is to change Sample and ComplexSample to be value-types. This will reduce the complexity of the graph dealt with by GC as while the arrays and linked lists are still collected, they contain those values directly rather than references to them, and that simplifies the rest of GC.

Then I'd measure performance at this point. The impact of working with relatively large structs is mixed. The guideline that value types should be less than 16 bytes comes from it being around that point where the performance benefits of using a reference type tend to overwhelm the performance benefits of using a value type, but that guideline is only a guideline because "tend to overwhelm" is not the same as "will overwhelm in your application".

After that if it had either not improved things, or not improved things enough, I would consider using a pool of objects; whether for those smaller objects, only the larger objects, or both. This will most certainly increase the complexity of your application, but if it's time-critical, then it might well help. (See How do these people avoid creating any garbage? for example which discusses avoiding normal GC in a time-critical case).

If you know you'll need a fixed maximum of a given type this isn't too hard; create and fill an array of them and dole them out from that array before returning them as they are no longer used. It's still hard enough in that you no longer have GC being automatic and have to manually "delete" the objects by putting them back in the pool.

If you don't have such knowledge, it gets harder but is still possible.

If it is really vital that you avoid GC, be careful of hidden objects. Adding to most collection types can for example result in them moving up to a larger internal store, and leaving the earlier store to be collected. Maybe this is fine in that you've still reduced GC use enough that it is no longer causing the problem you have, but maybe not.

Long-term tests are looking good so far and I will accept your answer as it has the most work in it, but all answers where part of getting to the solution... — Christoph Fink, May 29 '15 at 07:34

xanatos · Answer 2 · 2015-05-27T12:25:21.940

2

Rarely I've seen a LinkedList<> used in .NET... Have you tried using a List<>? Consider that the basic "element" of a LinkedList<> is a LinkedListNode<> that is a class... So for each Sample there is a whole additional overhead of one object.

Note that if you want to use "big" value types (as suggested by others), the List<> could become again slower (because the List<> grows by "generate a new-internal array of double the current size size and copy from old to new), so the bigger the elements, the more memory the List<> has to copy around when it doubles itself.

If you go to List<> you could try splitting the Sample to

List<ushort> CommandIndex;
List<Sample> ValueChannels;

This because the doubles of Sample require 8 byte alignment, so as written the Sample is 24 bytes, with only 18 bytes used.

This wouldn't be a good idea for LinkedList<>, because the LL has a big overhead per item.

edited May 27 '15 at 12:25

answered May 27 '15 at 12:17

xanatos

109,618
12
197
280

Each newly arriving sample will be added to the end of the list until all samples are received and there LinkedList<> is MUCH faster as it just appends the new item at the end. But I just talked to a collegue and we think we can use a "simple array" as we now the size beforehand... – Christoph Fink May 27 '15 at 12:18
@ChrFin The `List<>` is still O(1), because it goes by doubling of size. While a `LinkedList` is perhaps faster on `Add`, it is much heavier on the GC, because it generates many objects that are chained together (so the GC has to go through the whole chain) – xanatos May 27 '15 at 12:23
@ChrFin Algorithmic complexity really isn't the golden standard anymore. I mean it never really was the absolute measure of performance, but especially these days with caches being so much faster than DRAM access (sometimes like the difference between 3 clock cycles and 100), the fastest code often has to play to the cache. While linked list insertion is constant time and list insertion is only amortized constant time, the latter often outperforms the former in practice because it's so cache-friendly. The linear-time complexity of reallocating the array when its capacity is exceeded is more... – May 27 '15 at 17:13
... than made up by the increased cache hits (ex: if you are constantly doing things with the back of the list, multiple elements at the back of the list are going to fit in a cache line and you'll often just be reading and writing directly to a cache line). When multiple elements fit into a cache line, you can see improvements in multiple orders of magnitude. To appreciate the micro-efficiency side faster, it can really help to make friends with a good profiler, and these days it's getting harder and harder to ignore since micro-optimizations don't necessarily have micro impact. – May 27 '15 at 17:14
1

@Ike But then you have to pit this against the non-locality of the `LinkedListNode<>`, that is an additional object. Even this will break cache lines. – xanatos May 27 '15 at 17:21
@xanatos Yeah, that's what I mean. With a linked list we lose locality of reference unless we can combine it with a contiguous allocator. I upvoted your answer since I think a speed boost here is going to come from contiguity (effectively locality) using `List` instead and avoiding GC for the elements it stores. – May 27 '15 at 17:24

score 1 · Answer 3 · answered May 27 '15 at 11:58

1

Change Sample and ComplexSample to struct.

answered May 27 '15 at 11:58

Serj-Tm

16,581
4
54
61

I am trying this at the moment - just takes some time as a test run needs to run a few minutes each... – Christoph Fink May 27 '15 at 12:19

Improve RAM usage behaviour to avoid lags

3 Answers3