0

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:

      `subset_path='c:\data\grant\files'
      f=gzip.open(filename,'subset_full.tar.gz')
      subset_data_path=os.path.join(subset_path,'f')

The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.

Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.

What's going on?

DJohnson
  • 148
  • 11
  • "Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment." - which error? – Ami Tavory May 24 '15 at 12:45

1 Answers1

0

The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.

Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples

edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.

However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.

The following ought to work:

import tarfile

subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')
Eric Appelt
  • 2,843
  • 15
  • 20
  • I should add that your code worked on a line by line basis as opposed to running a script...not sure why that is the case but at least it worked! – DJohnson May 25 '15 at 14:30