0

I need to extract .tex files from multiple .gz files that are inside a single .tar file. I wrote some code that does this successfully, but I am unzipping the .tar and every .gz file. Is there a way to avoid doing so much unzipping? I would like to navigate straight to the .tex files and only extract these.

def extractFile(filename):
    tar = tarfile.open(filename)
    for item in tar:
        # Extract from .tar into 'temp' subfolder only if .gz
        if item.name.endswith('.gz'):
            item.name = os.path.basename(item.name) # reset path to remove parent directories like '0001'
            if not os.path.isdir('temp'):
                os.makedirs('temp')
            tar.extract(item, path='temp')
            # Extract from .gz into 'temp' subfolder only if .tex
            try: 
                gz = tarfile.open('temp/' + item.name, mode='r:gz')
                for file in gz:
                    if file.name.endswith('.tex'):
                        gz.extract(file, path='latex')
            except tarfile.ReadError:
                # Move to 'error' folder, ensuring it exists
                if not os.path.isdir('error'):
                    os.makedirs('error')
                os.rename('temp/' + item.name, 'error/' + item.name)
Ry-
  • 218,210
  • 55
  • 464
  • 476
brienna
  • 1,415
  • 1
  • 18
  • 45
  • There might not be. The tar format doesn’t have a list of files – you just have to keep reading through it to find what you’re looking for, and it looks like you’re already doing that. – Ry- Mar 04 '18 at 06:09
  • There's a getnames() function in the tarfile module. I tried playing around with this, but was unable to actually extract a file - but I could see the data within. So this made me think it might be possible to retrieve the file without unzipping everything, or at least saving everything. – brienna Mar 04 '18 at 06:11
  • You aren’t unzipping everything – `for item in tar` only loops over file info, same as `getnames`. Looping over that info still means moving through the whole file, though, because of how tar works. – Ry- Mar 04 '18 at 06:21
  • I don't want to save any individual .gz files to my computer. But I seem to have to, if I want to be able to access their contents. This is what I mean by unzipping. The actual file has to be extracted if I want to dig further into it, instead of being able to "hold" it while I dig into it. – brienna Mar 04 '18 at 06:24
  • 2
    Oh, okay. You can probably pass a `fileobj=…` into `tarfile.open` instead of a filename, then (https://docs.python.org/3/library/tarfile.html#tarfile.open), with [`extractfile`](https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.extractfile) instead of `extract`. – Ry- Mar 04 '18 at 06:28
  • @Ryan: yes a fileobj that comes from `Gzip` object so decompression is transparent. But it seems that OP is doing even better. I don't understand what could be improved (except from dropping this ancient tar.gz format) – Jean-François Fabre Mar 04 '18 at 06:38
  • @Jean-FrançoisFabre: The fileobj would be the replacement for `temp` as requested. The part where gzip comes in doesn’t change. I don’t understand your comment. – Ry- Mar 04 '18 at 06:56

1 Answers1

1

I was able to answer my question with the help of the comments. (Thanks!) My code now extracts .tex files from multiple .gz files that are inside a single .tar file, without unzipping/saving each .gz file to the computer.

def extractFile(filename):
    tar = tarfile.open(filename)
    for subfile in tar.getmembers():
        # Open subfile only if .gz
        if subfile.name.endswith('.gz'):
            try: 
                gz = tar.extractfile(subfile)
                gz = tarfile.open(fileobj=gz)
                # Extract file from .gz into 'latex' subfolder only if .tex
                for subsubfile in gz.getmembers():
                    if subsubfile.name.endswith('.tex'):
                        gz.extract(subsubfile, path='latex')
            except tarfile.ReadError:
                # Add subfile name to error log
                with open('error_log.txt', 'a') as log:
                    log.write(subfile.name + '\n')
brienna
  • 1,415
  • 1
  • 18
  • 45