3

I'm working in a memory constrained environment and uses a Python script with tarfile library (http://docs.python.org/2/library/tarfile.html) to continuously make backups of log files.

As the number of log files have grown (~74 000) I noticed that the system effectively kills this backup process when it runs now. I noticed that it consumes an awful lots of memory (~192mb before it gets killed by OS).

I can make a gzip tar archive ($ tar -czf) of the log files without a problem or high memory usage.

Code:

import tarfile
t = tarfile.open('asdf.tar.gz', 'w:gz')
t.add('asdf')
t.close()

The dir "asdf" consists of 74407 files with filenames of length 73. Is it not recommended to use Python's tarfile when you have a huge amount of files ?

I'm running Ubuntu 12.04.3 LTS and Python 2.7.3 (tarfile version seems to be "$Revision: 85213 $").

Niklas9
  • 8,816
  • 8
  • 37
  • 60
  • We have no clue how you're using it. – Ignacio Vazquez-Abrams Jan 10 '14 at 09:05
  • AFAIK `tarfile` is a pure-python module, so there's no surprise that it *might* consume quite a bit more memory than the `tar` command. – Bakuriu Jan 10 '14 at 09:06
  • 1
    Could you show us your code? There may be a number of reasons why this is happening, as according to the documentation the TarFile class processes its data in blocks of ~(20 * 512) bytes when opened in stream mode. Do you have yours open for random access instead? (http://docs.python.org/2/library/tarfile.html). – Brett Lempereur Jan 10 '14 at 10:00
  • You might in deed fare better by using the binary `tar` instead of that Python tar module in your case. – Alfe Jan 10 '14 at 10:47
  • @IgnacioVazquez-Abrams I updated the question with some code, but it's just basic standard code really.. added some specs with the amount of files and filenames length.. if that matters.. – Niklas9 Jan 10 '14 at 15:18
  • Have you tried walking the tree yourself instead? – Ignacio Vazquez-Abrams Jan 10 '14 at 15:20
  • @IgnacioVazquez-Abrams what do you mean by "walking the tree"? – Niklas9 Jan 10 '14 at 15:22

2 Answers2

4

I did some digging in the source code and it seems that tarfile is storing all files in a list of TarInfo objects (http://docs.python.org/2/library/tarfile.html#tarfile.TarFile.getmembers), causing the ever increasing memory footprint with many and long file names.

The caching of these TarInfo objects seems to have been optimized significantally in a commit from 2008, http://bugs.python.org/issue2058, but from what I can see it was only merged with py3k branch, for Python 3.

One could reset the members list again and again, as in http://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/, however I'm not sure what internal tarfile functionality one misses then so I went with using a system level call instead (> os.system('tar -czf asdf.tar asdf/').

Niklas9
  • 8,816
  • 8
  • 37
  • 60
  • 1
    https://web.archive.org/web/20160714075947/http://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/ In short: `tar = tarfile.open(filename, 'r:gz'); [... code ...]; tar.members = [];` – mxmlnkn Nov 16 '19 at 18:25
0

two ways to solve: if your VM does not have swap add and try. i have 13GB files to be tarred to a big bundle it was consistently failing. OS killed . Adding 4GB swap helped.

if you are using k8-pod, or docker container one quick workaround could be - add swap in host , capability:sys-admin or privilege mode will use host swap.

if you need tarfile with stream to avoid memory - checkout : https://gist.github.com/leth/6adb9d30f2fdcb8802532a87dfbeff77

TheFixer
  • 89
  • 1
  • 2