disable python file caching when reading large amount of files or lmdb

Question

My code runs on CentOS 6.6 on a cluster node with 100GB memory. However this seems still not large enough because my code needs to read 1000+ hickle files (each 200MB). This is totally 240GB. When the code is running, the system memory cache keeps increasing until full and the code performance becomes very slow when allocating new object and doing numpy arrays calculations.

I tried to do gc.collect and del to prevent any memory leakage, but the memory is still increasing. I doubt this is due to file caching. So I wonder if there is a function in python sys or os lib that can disable python file system caching when reading large amount (1000) of large hickle files (200MB each) or a single lmdb file (240GB). Actually, I don't really need to cache those files once read.

Did you ever solved this problem? I am very interested in this topic. In my case I load `.json` files in parallel. `gc.collect()` seems to show improvements but at this stage I am not completely sure about this. I am just wondering why the memory is not getting freed if the loaded file runs out of scope. This might be helpful: http://stackoverflow.com/a/5071376/4773274 — holzkohlengrill, Sep 16 '16 at 09:24

score 0 · Answer 1 · answered Sep 20 '15 at 02:35

Since Python uses reference counting, most objects are freed on deletion.

The only good thing the automatic garbage collector does for you is to collect and free deleted objects that have circular references -- for example if you have objects that refer to themselves or that mutually refer to each other:

>>> class Foo(object): pass
... 
>>> x, y = Foo(), Foo()
>>> x.y = y
>>> y.x = x

If you never write code that creates such references, or if you create them, but then manually break them, all the garbage collector does is to waste your CPU time, trying to find things to collect that aren't there. If you have a lot of small objects, this will really slow your system down.

I don't know about your hickle files -- sometimes it is necessary to create circular references when reading or writing such things, but for a lot of code, the best thing to do with the garbage collector is to completely shut it off using gc.disable().

disable python file caching when reading large amount of files or lmdb

1 Answers1