2

How to avoid MemoryError when opening a series of large .gz files in a loop considering the fact that the files don't yield errors when opened individually?

I have a series of .gz files (each as large as 440 Mb) stored in my computer (in case you want to try the codes with them they are the psc files in this directory). I want to open the first and do some operations with it, then open the second and do some operations and so on.

When I execute this code

import gzip

files=['thing1.gz', 'thing2.gz']
x=list(gzip.open(files[0],"r"))

, or this code

import gzip

files=['thing1.gz', 'thing2.gz']
x=list(gzip.open(files[1],"r"))

, namely, when I open each file separatedly, even if they are huge in size, I don't encounter any problem.

But I'm a lazy man so I want to do this for many files without having to execute the script with a different file manually. Therefor I need a for loop for that, like so

import gzip

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=list(gzip.open(current_file,"r"))

And now is when I run into a problem, a MemoryError more precisely. I just assumed that the x variable would be renamed and any remains of the previous file would be overwriten.

I've searched through many similar questions (I think this is not a duplicate since all of those similar questions were solved with one method or another, but mine couldn't work with them).

Just to save you some time these are the possible solutions I tried that have failed to solve my problem:

Failed Solution #1

import gzip

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=list(gzip.open(current_file,"r"))
    del x

This doesn't work, neither waiting some time as was suggested in anothere thread.

import gzip
import time

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=list(gzip.open(current_file,"r"))
    time.sleep(120)
    del x

Neither creating a function that deletes all unimportant variables also suggested in a another thread (as far as I understand it is the same as del so why this should have worked anyway?).

import gzip

def clearall():
    not_variables=[var for var in globals() if (var[:2],var[-2:])==("__","__")]
    white_list=["files","gzip","clearall"]
    black_list=[var for var in globals() if var not in white_list+not_variables]
    for var in black_list:
        del globals()[var]

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=list(gzip.open(current_file,"r"))
    clearall()

Failed Solution #2

Closing the file is another idea that does not work

import gzip

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=gzip.open(current_file,"r")
    y=list(x)
    x.close()
    del y

Failed Solution #3

Forcing the garbage collector as was said in many similar questions does a bad job also for some reason (maybe I haven't understanded how it works).

import gzip
import gc

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=list(gzip.open(current_file,"r"))
    gc.collect()

As Jean-François Fabre pointed this is a bad use of the garbage collector (I don't edit the previous code because maybe helps to understand to some people since I saw it written in that way in some threads).

The new code, sadly still doesn't work

import gzip
import gc

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=list(gzip.open(current_file,"r"))
    x=None
    gc.collect()

Failed Solution #4

Then, thinking of me as a clever girl, I tried making two scrpts; the first opens a specific file (whos name is specified in a txt document that has also to be opened obviously) and makes some operations with that file, and the other is just the one that created that txt file with the name of the current file that has to be opened by the other script and runs it for that file (in a loop). Namely, I divided the script in two; one that opens de file and one that creates the loop so all the files are opened. This seems logical to me because when I open each file separatedly there are no problems. I just had to open them sequentially and automatically with another script! But as it turns out this doesn't work either.

This is the script that loops on the other script:

import os

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    temporary_file=open("temp.txt","w")
    temporary_file.write(current_file)
    temporary_file.close()
    execfile("file_open_and_process.py")

os.remove("temp.txt")

And this is file_open_and_process.py used by the first script:

import gzip

current_file=open("temp.txt","r").read()
x=list(gzip.open(current_file,"r"))

Failed Solution #5

Another idea is to make all the file opening and working thing as a function and then call it in loop so the variables get stored in the memory as local instead of global variables, as was said in yet another thread. But this doesn't work either.

import gzip

def open_and_process(file):
    return list(gzip.open(current_file,"r"))

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=open_and_process(current_file)
    del x

Is really important for me to understand why this is happening or at least get a solution that allows me to change very little in the code (a code that is very complex in comparison to these toy examples).

Thanks you in advance!

Swike
  • 171
  • 8
  • 1
    Do you need to store the entire file in the list at once, or can you just iterate over the file object? – Jesse Bakker Feb 17 '18 at 15:51
  • @JesseBakker Not entirely necessary. I would like to know the options in that case. But since I feel this is an important thing to know for the future let's say that I really need to go over all the file at once. I can open them separatedly so I hope there is a way to do it one by one deleting the previous from memory. – Swike Feb 17 '18 at 16:12
  • FYI, the file size of the first file, psc_aaa.gz, uncompressed, is 1,718,317,178 bytes. – Mark Tolonen Feb 17 '18 at 16:15
  • If the uncompressed size of all the files you're opening exceeds the amount of memory in your computer, you will not be able to store all the data in memory. The `list()` constructor takes an iterable of objects and stores all the objects in memory. Ergo, if your file is larger than the amount of memory, then creating a `list()` will never work. You will need to process the data in smaller chunks (e.g. one line at a time). – Daniel Pryden Feb 17 '18 at 16:54
  • @DanielPryden But as I said, I don't want to store all the data in memory. I just want to store the data from one file do something with it and then delete it and pass to the next file. I can do it manually, no problem here. I can open the program for each file each time (first two codes in my question). The problem to do it automatically. Even a script that tells the cursor how to move in the screen and wich frograms to click would work since that's what I'm doing manually and it works. But for obvious reasons I would prefer a more elegant solution. Line by line is not an option either here. – Swike Feb 17 '18 at 17:11
  • Why is line-by-line not an option? Your data looks like it is line-oriented. – Daniel Pryden Feb 17 '18 at 17:55
  • I think this question is probably unanswerable unless you provide a [mcve] that actually runs out of memory. – Daniel Pryden Feb 17 '18 at 17:57
  • If you're truly running into process limits (which I doubt, unless this is a 32-bit machine or Python runtime you're using?), then `execfile` wouldn't help. You would need to use the `subprocess` or `multiprocessing` module to invoke the processing steps as separate OS processes. – Daniel Pryden Feb 17 '18 at 18:00

2 Answers2

3

You processing must be so fast that the garbage collector cannot run unless you're forcing it (or it didn't reach its collecting threshold)

I cannot test your example with your data but the last snippet which forces the call (which is the right thing to do) incorrectly uses the garbage collector:

import gzip
import gc

files=['thing1.gz', 'thing2.gz']
for current_file in files:
    x=list(gzip.open(current_file,"r"))
    gc.collect()

when you call gc.collect() you're not collecting the current x but the previous one. You have to del x before calling the garbage collector, because you cannot afford to have both files present in memory.

for current_file in files:
    x=list(gzip.open(current_file,"r"))
    # work
    x = None  # or del x
    gc.collect()  # now x will surely be collected

now, if that still doesn't work for some (wierd) reason, just do 2 processes and call them with an argument:

master.py contains:

import subprocess
for current_file in files:
   subprocess.check_call(["python","other_script.py",current_file])

other_file.py would contain the processing:

import sys,gzip
with gzip(open(sys.argv[1])) as f:
   x = list(f)
   # rest of your processing

in the end, store the results of your processing (which must be smaller) in a result file.

After all processes have run, gather the data in the master.py script and continue.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • You are right. I did not used the garbage collector correctly. But I still get the same MemoryError sadly ;( – Swike Feb 17 '18 at 16:04
  • @Swike Memory usage in my test peaks at >2GB when processing one file, and that wasn't converting the file into a list...just a simple `data = current_file.read()`. – Mark Tolonen Feb 17 '18 at 16:07
  • @Mark Tolonen. I used psc_aaa.gz and psc_aab.gz as files. You can find them here: http://irsa.ipac.caltech.edu/2MASS/download/allsky/ If you used them then its strange because I totally can open each one separatedly without any MemoryError using the first two codes in my question. – Swike Feb 17 '18 at 16:09
  • @Swike I just downloaded `psc_aaa.gz` and read it more than once. I didn't get a Memory error, but it peaked around 2GB for one file. I have a 64-bit system with 3GB memory, but could here the swap file kicking in. The content of the file is lines of text. I suggest processing them line-by-line if possible. – Mark Tolonen Feb 17 '18 at 16:11
  • @MarkTolonen Great advice I'm going to try now, but I feel that this doesn't answer the question. I can run the program many times and open the files separatedly each time, why I can't automate this process? Why a script can't run this process in a loop but I can manually? Why I can't absolutely clean the memory for each iteration inside the script? I need to know if there is a way that does not involve reading line by line since the script can totally open each file and should totally delete each file from memory each time (as it does when I run again the script). But how? – Swike Feb 17 '18 at 16:32
  • @Swike If this answer doesn't work, then it may be a limitation of Python's garbage collector that it doesn't completely clean the memory as you need, or simply memory fragmentation after processing a few files makes allocation fail, which is a limitation of the OS. this amount of data is unwieldy when processed this way. Ideally, process line-by-line and write to a database that can handle large datasets. – Mark Tolonen Feb 17 '18 at 16:42
  • @Swike BTW, this answer does work for me on Windows 64-bit, 64-bit Python 3.6 and 3GB of memory in my system. What are your system specs? – Mark Tolonen Feb 17 '18 at 16:45
  • @MarkTolonen Thnank you, this worked for me. But as I previously said my question involves an answer that could be applied for the case I want to work with the entire file and not line by line. This is good for me but probably not for others. And the question is the same. I can run the script for each file one at a time, I want to automate this process, this shouldn't be impossible. Line by line is not the general solution to my question so If you know another one it would help a lot to those who might want to work with the entire file each time they try to open one of the files in the loop – Swike Feb 17 '18 at 16:54
  • @Swike: I edited my answer to show an alternative. memory fragmentation explanation is interesting – Jean-François Fabre Feb 17 '18 at 17:28
2

The file size of psc_aaa.gz is 1,718,317,178 bytes uncompressed. If possible process the files line-by-line instead of in memory all at once:

import gzip

files=['psc_aaa.gz']
for current_file in files:
    with gzip.open(current_file,'rt') as f:
        for line in f:
            print(line,end='')

Output (first few lines):

1.119851|-89.91861|0.11|0.06|90|00042876-8955069 |12.467|0.018|0.021|359.4|12.131|0.025|0.026|224.7|11.963|0.023|0.025|133.7|AAA|222|111|000|666666|37.2|245|1329023254|0|0|1101364107|s|2000-09-22|64|302.951|-27.208|1.6|2451809.7124|1.07|1.18|0.81|12.481|0.014|12.112|0.028|11.98|0.012|332|251|sw|1|1|0|\N|\N|\N|\N|0|\N|59038|1357874|267
1.296576|-89.933235|0.14|0.14|73|00051117-8955596 |16.445|0.147|0.148|8.9|15.49|0.154|0.154|7.7|14.71|0.132|0.132|9.9|BBB|222|111|000|060616|13.6|290|1181038081|0|0|1085342201|s|2000-08-03|111|302.947|-27.194|2.6|2451759.8041|1.31|0.94|1.38|15.996|0.102|14.956|0.161|14.269|0.212|286|250|sw|1|1|0|\N|\N|\N|\N|0|\N|58104|1336392|267
3.373635|-89.964142|0.25|0.23|175|00132967-8957509 |16.601|0.134|0.135|8|16.005|0.185|0.185|5.7|15.512|0.212|0.212|5.3|BCC|222|111|000|060605|25.4|148|1085389169|0|0|1229087102|s|2000-09-02|55|302.939|-27.164|23.9|2451789.6258|0.85|1.1|0.92|16.909|0.316|16.458|0.573|15.476|0.335|175|229|sw|1|1|0|\N|\N|\N|\N|0|\N|66092|1520116|267
7.821089|-89.912903|0.12|0.07|0|00311706-8954464 |12.431|0.021|0.024|346.8|12.038|0.025|0.027|205.9|11.937|0.024|0.026|141.8|AAA|222|111|000|666666|41|237|1101364107|0|0|1127037907|s|2000-09-01|66|302.941|-27.215|-6.7|2451788.7241|1.02|1.11|1.41|12.419|0.008|12.03|0.032|11.912|0.034|354|245|se|1|1|U|0.3|4|15.2|13|1|\N|60459|1390557|267
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251