1

As title says, I'm downloading a bz2 file which has a folder inside and a lot of text files...

My first version was decompressing in memory, but Although it is only 90mbs when you uncomrpess it, it has 60 files of 750mb each.... Computer goes bum! obviusly cant handle like 40gb of ram XD)

So, The problem is that they are too big to keep all in memory at the same time... so I'm using this code that works but its sucks (Too slow):

response = requests.get('https:/fooweb.com/barfile.bz2')

# save file into disk:
compress_filepath = '{0}/files/sources/{1}'.format(zsets.BASE_DIR, check_time)
with open(compress_filepath, 'wb') as local_file:
    local_file.write(response.content)

#We extract the files into folder 
extract_folder = compress_filepath + '_ext'
with tarfile.open(compress_filepath, "r:bz2") as tar:
    tar.extractall(extract_folder)

# We process one file at a time:
for filename in os.listdir(extract_folder):
    filepath = '{0}/{1}'.format(extract_folder,filename)
    file = open(filepath, 'r').readlines()
    
    for line in file:
        some_processing(line)

Is there a way I could make this without dumping it to disk... and only decompressing and reading one file from the .bz2 at a time?

Thank you very much for your time in advance, I hope somebody knows how to help me with this...

  • What do you want to do with the decompressed files? You can extract one file at a time from the archive although it will be pretty slow – Iain Shelvington Jun 29 '21 at 00:30
  • I need to do some processing with every line of every file... What I'm looking for is a way to access files on memory, one by one from the requests.content... without downloading>extracting it on disk (if possible) – Marcos Federico Mandrille Jun 29 '21 at 01:23

2 Answers2

0
#!/usr/bin/python3
import sys
import requests
import tarfile
got = requests.get(sys.argv[1], stream=True)
with tarfile.open(fileobj=got.raw, mode='r|*') as tar:
    for info in tar:
        if info.isreg():
            ent = tar.extractfile(info)
            # now process ent as a file, however you like
            print(info.name, len(ent.read()))
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
0

I did it this way:

response = requests.get(my_url_to_file)
memfile = io.BytesIO(response.content)
# We extract files in memory, one by one:
tar = tarfile.open(fileobj=memfile, mode="r:bz2")
for member_name in tar.getnames():
    filecount+=1
    file = tar.extractfile(member_name)
 
    with open(file, 'r') as read_file:
        for line in read_file:
            process_line(line)