7

I'm coding a XML parser with libxml2. Actually, I finished it but there is a pretty annoying problem of memory. The program firstly get some links from my database and all those links point to a XML file. I use curl to download them. The process is simple : I download a file, then I parse it, and so on...

The problem seems to be when a parsing is finished. Curl downloads the next file but it seems that the previous XML is not freed, because I guess libxml2 loads it in RAM. When parsing the last XML, I find myself with a ~2.6GB of leak (yeah, some of these file are really big...) and my machine only has 4GB of RAM. It works for the moment, but in the future, more links will be added to the database, so I must fix it now.

My code is very basic:

xmlDocPtr doc;
doc = xmlParseFile("data.xml");

/* code to parse the file... */

xmlFreeDoc(doc);

I tried using:

xmlCleanupParser();

but the doc says : "It doesn't deallocate any document related memory." (http://xmlsoft.org/html/libxml-parser.html#xmlCleanupParser)

So, my question is : Does somebody know how to deallocate all this document related memory ?

Pwet
  • 95
  • 1
  • 4
  • If you're loading huge files into memory, I don't understand why you're so surprised at the huge memory usage. libxml2 is a well respected piece of code used by many important software systems, I very much doubt there is a "huge memory leak" with correct usage of libxml2. – carlosdc May 20 '13 at 22:32
  • How do you know there is a memory leak? Maybe your measurement is flawed... Note that memory statistics may be notably hard to interpret correctly. – rodrigo May 20 '13 at 22:33
  • 2
    Try running it under valgrind, it reports where unallocated memory was allocated. – Maxim Egorushkin May 20 '13 at 22:36
  • 1
    @carlosdc: I'm not surprised of the memory usage, I just don't get how to free all this useless data. The memory usage only grows up when a download is finished (so, just before a parsing), after that it's absolutely static. – Pwet May 20 '13 at 22:47
  • @rodrigo: I can see that with htop, nothing is freed after my call to xmlFreeDoc(doc), so I guess I'm missing something there. When a parsing is finished, and so, when the next download begins, the memory usage stays still. xmlFreeDoc and xmlFreeParser are the only functions that seem to do something about freeing memory, but the both of them don't free what I want. I read a lot of doc but can't find something about that. – Pwet May 20 '13 at 23:02
  • As rodrigo pointed out, htop is not suited for the job. Try valgrind. – Remi Gacogne May 21 '13 at 11:27

2 Answers2

6

The problem is that you are looking at the statistics in the wrong way...

When a program starts it allocates some memory from the OS for the heap. When it does malloc (or similar function) the C runtime takes slices from that heap until it runs out. After that, it automatically asks the OS for more memory, maybe each time in greater blocks. When the program does free it marks the freed memory as available for further mallocs, but it will not return the memory to the OS.

You may think that this behavior is wrong, that the program is leaking, but it is not: the freed memory is accounted for, just not in the OS but in the C library layer of your application. Proof to that is that the memory for the second XML file does not add to the first one: it will only be noticeable if it is the greatest file yet.

You may also think that if this memory is not used any longer by this program, it is just wasted there and it cannot be used for other processes. But that's not true: if the memory is not touched in a while and it is needed elsewhere, the OS Virtual Memory Manager will swap it out and reuse it.

So, my guess is that actually you don't have a problem.

PS: What I've just described is not always true. Particularly many C libraries make a distinction between small and large memory chunks and allocate them differently.

rodrigo
  • 94,151
  • 12
  • 143
  • 190
  • This is really interesting, I didn't know that. Thank you for the explaination. After several tests with valgrind, I can say you were right. Thank you again :) – Pwet May 21 '13 at 22:29
  • Ok maybe you are right and there are no memory leaks at all. But the fact is that memory usage is growing and finally when it reaches almost 95% of available system memory, program crashes. How to solve this problem. My code is the same as @Pwet – SP5RFD Apr 08 '16 at 09:17
  • @crooveck: That usually means one of: A) you are not freeing the memory after use (you are leaking) because if you did free it, it would be reused and you will not reach that 95%; B) you actually need all that memory (you are not leaking), for example because you read a 8GB XML file into memory, and you only have 4GB of physical memory plus swap... and that will not work. – rodrigo Apr 08 '16 at 09:21
  • 1
    My code is identical as @Pwet shows. But I run it in a loop. I do xmlFreeDoc() after each use of xmlParseFile(). My XML is about few kB but is being read many many times. After few hours my program is being killed by OS. I can confirm after running Valgrind and Memwatch that there are no memory leaks, but it still grows up to almost 100% of system memory - before being killed. – SP5RFD Apr 08 '16 at 09:42
  • I'm curious then, why is that when I alloc 1 gig using malloc()... write all zeros to it... then free it.... it returns the memory to the OS, but libxml does not? – Rahly Aug 15 '16 at 16:33
  • @Rahly: As I said in my answer, big allocations are usually handled differenty. Your single 1GB alloc is probably allocated using an anonymous memory mapping, so it is released directly to the OS, but libxml will make about a million small allocations, and those will be done from the heap. – rodrigo Aug 15 '16 at 18:28
  • I've done... 100m... 10m.... 1m... they seem to all go back to the system on a free – Rahly Aug 15 '16 at 18:49
  • @Rahly: Those are still pretty big allocations. For example, GNU `malloc` considers by default a big allocation if it is greater than 128 kB. Note that allocations for `libxml` will likely be just a few dozens of bytes each. I recommend trying allocations of 1024 bytes or so. – rodrigo Aug 15 '16 at 19:12
  • I did 10000 allocations of 10k... and 1000000 allocations of 1k.... all was returned to the operating system when free – Rahly Aug 15 '16 at 20:50
  • @Rahly: dynamic memory is complex and smart. I wrote a test program what you do and true, it is returned to the system. However, if I free all but the very last block then none is returned to the OS. It looks like the GNU `malloc()` tries to shrink the heap and return it, or a part of it, to the OS in some particular cases, such as yours. – rodrigo Aug 15 '16 at 21:37
  • Correct... thay would mean that libxml is "holding" onto some memory whoch is preventing it from shrink... most memory is allocated at least in 4k chunks... even if you as for 1 byte... – Rahly Aug 15 '16 at 21:48
  • @Rahly: From the point of view of the OS, yes memory is allocated in pages, normally 4K, but for the heap used by `malloc()` that depends on the implementation (mine is just 24 bytes, as per `malloc_usable_size(malloc(1))`). About the holding of memory, maybe. A lot of things may be going on... one-time initialization, another library, weird malloc implementation... – rodrigo Aug 15 '16 at 23:36
  • Im guessong from the point of the application as well.. chances are that the library os creating member that it holds... but it gets freed when the library is unloaded... and thats why valgrind doesnt see any memory leaks... but this is TERRIBLE for long running programs... because if you process say a 200 meg xml filem – Rahly Aug 15 '16 at 23:40
  • Then youve allocated memory for it that yoi can never get back during normal program processes.... sorry but fork/process/die is a hack.. not a real solution – Rahly Aug 15 '16 at 23:41
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121006/discussion-between-rodrigo-and-rahly). – rodrigo Aug 15 '16 at 23:42
1

Late in the game but just found this post today. It could be useful for other readers too.

If you are parsing or generating large documents, you may consider the XmlReader and XmlReader APIs. The drastically reduce memory usage, actually almost constant usage no matter how large the input is.

http://xmlsoft.org/html/libxml-xmlreader.html http://xmlsoft.org/html/libxml-xmlwriter.html

Pierre
  • 716
  • 4
  • 10