0

I have a program which analyzes 150,000 files. Valgrind reports no memory leak but the program slows over time.

Some of the problems were related to using std::string too often and mktime taking too much time. (see C++ slows over time reading 70,000 files)

But it still slows down over time. Lotharyx suggested that container usage is causing heap fragmentation.

I read the various flow charts on the pros and cons of the different STL containers but I didn't quite get it.

In the pseudo code below, I'm not sure I've made the right choices to avoid heap fragmentation.

fileList.clear()
scan all disks and build "fileList", a std::set of file paths matching a pattern.

// filepaths are named by date, eg 20160530.051000, so are intrinsically ordered 

foreach(filePath in fileList)
{
    if (alreadyHaveFileDetails(filePath))
        continue;

    // otherwise
    collect file details into a fileInfoStruct;  // like size, contents, mod 

    fileInfoMap[time_t date] = fileInfoStruct;
}

// fileInfoMap is ordered collection of information structs of about 100,000 files

// traverse the list in order
foreach (fileInfo in fileInfoMap)
{
    if (meetsCondition(fileInfo))
    {
        TEventInfo event = makeEventInfo()
        eventList.push_back(event);
    }
}

And the above sequence repeats forever.

So for choice of containers, I've used (or need):

fileList -- list of unique strings containing 150,000 pathnames.
I chose std::set because it it automatically handles duplicates and maintains sort order automatically. No random access, only add the entries, sort them (manually or automatically), and iterate over them.

fileInfoMap -- an array of structures keyed by a time_t timestamp corresponding to the date of the file. I chose std::map. It too would have 150,000 entries so occupies a lot of memory. No random access, only add the entries to one end. Must iterate over them and, if necessary, delete entries from the middle.

eventList -- a small list of "event" structures, say 50 items. I chose std::vector. Not sure why really. No random access, only add entries to one end and later iterate over the collection.

I'm fairly new to C++. Thanks for your consideration.

Community
  • 1
  • 1
Danny
  • 2,482
  • 3
  • 34
  • 48

1 Answers1

2

About memory management, container belongs to two large families: the one that allocate all elements together, and the one that allocate elements separately.

vector and deque belong to the first family, list, set and map to the second.

Memory fragmentation arises when elements are continuously added and removed from a container that is not supporting global relocation.

One way to avoid the problem is to use a vectors, using "reserve" to anticipate the memory need to reduce relocations, and keeping the data sorted upon insertion.

Another way is to use "linking based container" (like list, set etc.) providing them an allocator that allocate memory from larger chunks, recycling them instead of calling a raw malloc/free for every single element insert/remove.

Give a look to std::allocator

You can easily write an allocator by deriving from std::allocator and overriding the allocate/deallocate functions adding all the required logic, and passing yourallocator as the optional template parameter of the container you will like to use.

Emilio Garavaglia
  • 20,229
  • 2
  • 46
  • 63
  • Agreed with the custom allocator suggestion. – Alan Jul 04 '16 at 17:08
  • `deque` does not allocate all its elements together. – Nicol Bolas Jul 04 '16 at 17:43
  • @NicolBolas: yes, it is somewhat half-way. But better than list or set. I put it together with vector because it does not allocate on a pure 1/1 base. – Emilio Garavaglia Jul 04 '16 at 18:41
  • Thanks for the suggestions. What do you mean "containers not supporting global relocation"? Google comes up with nothing on that term. Which containers do support it? I'm running more tests but suspect the std::map fileInfoMap is the problem... – Danny Jul 05 '16 at 06:48
  • @Danny: a vector has all its elements allocated in a same block of memory. If it's found to be not enough wide, another wider block is allocated, elements moved, and the old memory block released. List, map, set etc. are based on pointer-linked nodes (forming dual inked lists or B&R b-trees) each containing one element (or key-value pair), allocated individually. The job of "making those node to leave inside a same block" is not done by the container, and is not done also by the default std::allocator (that does just a malloc/free of single nodes) – Emilio Garavaglia Jul 05 '16 at 13:03