How to avoid MemoryError when opening a series of large .gz files in a loop considering the fact that the files don't yield errors when opened individually?
I have a series of .gz files (each as large as 440 Mb) stored in my computer (in case you want to try the codes with them they are the psc files in this directory). I want to open the first and do some operations with it, then open the second and do some operations and so on.
When I execute this code
import gzip
files=['thing1.gz', 'thing2.gz']
x=list(gzip.open(files[0],"r"))
, or this code
import gzip
files=['thing1.gz', 'thing2.gz']
x=list(gzip.open(files[1],"r"))
, namely, when I open each file separatedly, even if they are huge in size, I don't encounter any problem.
But I'm a lazy man so I want to do this for many files without having to execute the script with a different file manually. Therefor I need a for loop for that, like so
import gzip
files=['thing1.gz', 'thing2.gz']
for current_file in files:
x=list(gzip.open(current_file,"r"))
And now is when I run into a problem, a MemoryError more precisely. I just assumed that the x variable would be renamed and any remains of the previous file would be overwriten.
I've searched through many similar questions (I think this is not a duplicate since all of those similar questions were solved with one method or another, but mine couldn't work with them).
Just to save you some time these are the possible solutions I tried that have failed to solve my problem:
Failed Solution #1
import gzip
files=['thing1.gz', 'thing2.gz']
for current_file in files:
x=list(gzip.open(current_file,"r"))
del x
This doesn't work, neither waiting some time as was suggested in anothere thread.
import gzip
import time
files=['thing1.gz', 'thing2.gz']
for current_file in files:
x=list(gzip.open(current_file,"r"))
time.sleep(120)
del x
Neither creating a function that deletes all unimportant variables also suggested in a another thread (as far as I understand it is the same as del so why this should have worked anyway?).
import gzip
def clearall():
not_variables=[var for var in globals() if (var[:2],var[-2:])==("__","__")]
white_list=["files","gzip","clearall"]
black_list=[var for var in globals() if var not in white_list+not_variables]
for var in black_list:
del globals()[var]
files=['thing1.gz', 'thing2.gz']
for current_file in files:
x=list(gzip.open(current_file,"r"))
clearall()
Failed Solution #2
Closing the file is another idea that does not work
import gzip
files=['thing1.gz', 'thing2.gz']
for current_file in files:
x=gzip.open(current_file,"r")
y=list(x)
x.close()
del y
Failed Solution #3
Forcing the garbage collector as was said in many similar questions does a bad job also for some reason (maybe I haven't understanded how it works).
import gzip
import gc
files=['thing1.gz', 'thing2.gz']
for current_file in files:
x=list(gzip.open(current_file,"r"))
gc.collect()
As Jean-François Fabre pointed this is a bad use of the garbage collector (I don't edit the previous code because maybe helps to understand to some people since I saw it written in that way in some threads).
The new code, sadly still doesn't work
import gzip
import gc
files=['thing1.gz', 'thing2.gz']
for current_file in files:
x=list(gzip.open(current_file,"r"))
x=None
gc.collect()
Failed Solution #4
Then, thinking of me as a clever girl, I tried making two scrpts; the first opens a specific file (whos name is specified in a txt document that has also to be opened obviously) and makes some operations with that file, and the other is just the one that created that txt file with the name of the current file that has to be opened by the other script and runs it for that file (in a loop). Namely, I divided the script in two; one that opens de file and one that creates the loop so all the files are opened. This seems logical to me because when I open each file separatedly there are no problems. I just had to open them sequentially and automatically with another script! But as it turns out this doesn't work either.
This is the script that loops on the other script:
import os
files=['thing1.gz', 'thing2.gz']
for current_file in files:
temporary_file=open("temp.txt","w")
temporary_file.write(current_file)
temporary_file.close()
execfile("file_open_and_process.py")
os.remove("temp.txt")
And this is file_open_and_process.py used by the first script:
import gzip
current_file=open("temp.txt","r").read()
x=list(gzip.open(current_file,"r"))
Failed Solution #5
Another idea is to make all the file opening and working thing as a function and then call it in loop so the variables get stored in the memory as local instead of global variables, as was said in yet another thread. But this doesn't work either.
import gzip
def open_and_process(file):
return list(gzip.open(current_file,"r"))
files=['thing1.gz', 'thing2.gz']
for current_file in files:
x=open_and_process(current_file)
del x
Is really important for me to understand why this is happening or at least get a solution that allows me to change very little in the code (a code that is very complex in comparison to these toy examples).
Thanks you in advance!