2

Given a tarball containing multiple directories, how do I extract just a single, specific directory?

import tarfile  
tar = tarfile.open("/path/to/tarfile.tar.gz")  
tar.list()

... rootdir/subdir_1/file_1.ext
... rootdir/subdir_1/file_n.ext
... rootdir/subdir_2/file_1.ext
etc.

How would I extract just the files from subdir_2?

NOTE: The entire operation is being done in memory a la...

import tarfile, urllib2, StringIO  
data = urllib2.urlopen(url)  
tar = tarfile.open(mode = 'r|*', fileobj = StringIO.StringIO(data.read()))  

... so it's not feasible to extract all to disk and move the necessary folder.

Josh Whittington
  • 705
  • 3
  • 10
  • 23

1 Answers1

3

You seem to be almost there - I think you can just use the contents of getnames() and combine it with extractfile() to process the files in memory, e.g.:

import re
files = (file for file in tar.getnames() if file.startswith('rootdir/'))
Ari
  • 2,311
  • 1
  • 17
  • 17
  • How would you suggest saving the file objects returned by `tarfile.extractfile()`? I can't seem to find an appropriate method; is `pickle`/`cPickle` the right way to go? Or is there a better way? – Josh Whittington Jan 06 '14 at 00:21
  • Assuming the files are relatively small, you should be able to call .read() on the extracted file object and write the contents to a regular python file object that has been opened in write mode. – Ari Jan 06 '14 at 03:03
  • Ari - `for f in tarball.getnames():` `if f.startswith(package_name):` `open(package_name, 'w').write(tarball.extractfile(f).read())` returns an error: `tarfile.StreamError: seeking backwards is not allowed.` I run in to this no matter how many variations I try. It seems like I'm running in to a limitation of trying to deal with a stream of data using TarFile. – Josh Whittington Jan 06 '14 at 05:19
  • Got it. Ended up using ByteIO to construct an in-memory tempfile so I could seek back and forth. Looks like `tarfile.getnames()`/`tarfile.getmembers()` reads through the whole file due to the `header`/`data`/`header`/`data` nature of tarballs. (http://stackoverflow.com/questions/18623842/read-contents-tarfile-into-python-seeking-backwards-is-not-allowed) Thanks for your help. – Josh Whittington Jan 06 '14 at 05:46