High memory usage with Pythons native tarfile lib

Question

I'm working in a memory constrained environment and uses a Python script with tarfile library (http://docs.python.org/2/library/tarfile.html) to continuously make backups of log files.

As the number of log files have grown (~74 000) I noticed that the system effectively kills this backup process when it runs now. I noticed that it consumes an awful lots of memory (~192mb before it gets killed by OS).

I can make a gzip tar archive ($ tar -czf) of the log files without a problem or high memory usage.

Code:

import tarfile
t = tarfile.open('asdf.tar.gz', 'w:gz')
t.add('asdf')
t.close()

The dir "asdf" consists of 74407 files with filenames of length 73. Is it not recommended to use Python's tarfile when you have a huge amount of files ?

I'm running Ubuntu 12.04.3 LTS and Python 2.7.3 (tarfile version seems to be "$Revision: 85213 $").

AFAIK `tarfile` is a pure-python module, so there's no surprise that it *might* consume quite a bit more memory than the `tar` command. — Bakuriu, Jan 10 '14 at 09:06
Could you show us your code? There may be a number of reasons why this is happening, as according to the documentation the TarFile class processes its data in blocks of ~(20 * 512) bytes when opened in stream mode. Do you have yours open for random access instead? (http://docs.python.org/2/library/tarfile.html). — Brett Lempereur, Jan 10 '14 at 10:00
You might in deed fare better by using the binary `tar` instead of that Python tar module in your case. — Alfe, Jan 10 '14 at 10:47
@IgnacioVazquez-Abrams I updated the question with some code, but it's just basic standard code really.. added some specs with the amount of files and filenames length.. if that matters.. — Niklas9, Jan 10 '14 at 15:18
@IgnacioVazquez-Abrams what do you mean by "walking the tree"? — Niklas9, Jan 10 '14 at 15:22

score 4 · Accepted Answer · answered Jan 13 '14 at 13:07

I did some digging in the source code and it seems that tarfile is storing all files in a list of TarInfo objects (http://docs.python.org/2/library/tarfile.html#tarfile.TarFile.getmembers), causing the ever increasing memory footprint with many and long file names.

The caching of these TarInfo objects seems to have been optimized significantally in a commit from 2008, http://bugs.python.org/issue2058, but from what I can see it was only merged with py3k branch, for Python 3.

One could reset the members list again and again, as in http://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/, however I'm not sure what internal tarfile functionality one misses then so I went with using a system level call instead (> os.system('tar -czf asdf.tar asdf/').

https://web.archive.org/web/20160714075947/http://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/ In short: `tar = tarfile.open(filename, 'r:gz'); [... code ...]; tar.members = [];` — mxmlnkn, Nov 16 '19 at 18:25

score 0 · Answer 2 · answered Nov 18 '20 at 14:32

two ways to solve: if your VM does not have swap add and try. i have 13GB files to be tarred to a big bundle it was consistently failing. OS killed . Adding 4GB swap helped.

if you are using k8-pod, or docker container one quick workaround could be - add swap in host , capability:sys-admin or privilege mode will use host swap.

if you need tarfile with stream to avoid memory - checkout : https://gist.github.com/leth/6adb9d30f2fdcb8802532a87dfbeff77

High memory usage with Pythons native tarfile lib

2 Answers2