5

I'm new here and to python in general, so please forgive any formatting issues and whatever else. I'm a physicist and I have a parametric model, where I want to iterate over one or more of the model's parameter values (possibly in an MCMC setting). But for simplicity, imagine I have just a single parameter with N possible values. In a loop, I compute the model and several scalar metrics pertaining to it.

I want to save the data [parameter value, metric1, metric2, ...] line-by-line to a file. I don't care what type: .pickle, .npz, .txt, .csv or anything else are fine.

I do NOT want to save the array after all N models have been computed. The issue here is that, sometimes a parameter value is so nonphysical that the program I call to calculate the model (which is a giant complicated thing years in development, so I'm not touching it) crashes the kernel. If I have N = 30000 models to do, and this happens at 29000, I'll be very unhappy and have wasted a lot of time. I also probably have to be conscious of memory usage - I've figured out how to do what I propose with a text file, but it crashes around 2600 lines because I don't think it likes opening a text file that long.

So, some pseudo-code:

filename = 'outFile.extension'
dataArray = np.zeros([N,3])
idx = 0
for p in Parameter1:
    modelOutputVector = calculateModel(p)
    metric1, metric2 = getMetrics(modelOutputVector)
    dataArray[idx,0] = p
    dataArray[idx,1] = metric1
    dataArray[idx,2] = metric2
    ### Line that saves data here
    idx+=1

I'm partial to npz or pickle formats, but can't figure out how to do this with either. If there is a better format or a better solution, I appreciate any advice.

Edit: What I tried to make a text file was this, inside the loop:

fileObject = open(filename, 'ab')
np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
fileObject.write('\n')
fileObject.close()

The first time it crashed at 2600 or whatever I thought it was just coincidence, but every time I try this, that's where it stops. I could hack it and make a batch of files that are all 2600 lines, but there's got to be a better solution.

  • *I've figured out how to do what I propose with a text file, but it crashes around 2600 lines because I don't think it likes opening a text file that long.* --> Do you mean **writing** to a text file that long, as opposed to opening? There shouldn't be a problem with this. Can you show your file writing code that crashes? – MFisherKDX Mar 27 '19 at 00:02
  • I dont think io is a problem here. Why do you think writing to the file is the problem? Please provide the error message and traceback. – fabianegli Mar 27 '19 at 00:17
  • Edit: let me put this as an edit to my question so it formats nicely. – restlessleukocyte Mar 27 '19 at 00:17
  • @fabianegli I don't recall the error message and it takes a few hours to get to 2600 models. But, it works past that point when my save solution above is not implemented, and crashes when it does. – restlessleukocyte Mar 27 '19 at 00:22
  • 2
    What is the full error that it crashes with at 2600 lines? It's usually prudent to include a full stack trace within your posts on here so it can guide the answers. How else would we know if its a `pandas` issue, a file writing issue, or something weird and OS specific? Well the answer is that we can't and anything we post to help you is nothing more than speculation unfortunately – Reedinationer Mar 27 '19 at 00:25
  • @Reedinationer Is the solution I implemented the way you would do it? I'm looking more for advice on the best practice for this, what I have set up seems a bit...inelegant. Obviously there are better ways to store numpy arrays, but is there a way to write these to their respective filetypes by line? I really don't want to run it for several hours to get an error message for a method that should probably be replaced by something better. – restlessleukocyte Mar 27 '19 at 00:30
  • consider saving the numpy arrays and metrics in separate files and separate files for each model. this might make it easier for you to calculate models starting in the iteration before the error. – fabianegli Mar 27 '19 at 00:37
  • Any news? could you solve the problem or find the error/traceback? – fabianegli Mar 28 '19 at 10:42

3 Answers3

2

TL:DR; With your current code, if we run into an error on the xth model, the first x - 1 results will not be lost; they will still be in the text file, which gets saved by python automatically. However, we can also use try/except blocks to prevent python from crashing when one of the models causes an error, so we can attempt to get a result for all of the models. See code in the Putting It All Together section.

Avoid Losing Your Progress

As others have pointed out, it's impossible to tell you how to fix your error without knowing what the error is, or seeing a stack trace. However, we can prevent python from crashing by adding some error handling with a try/except block:

for p in Parameter1:
    try:
        # perform model calculations

        fileObject = open(filename, 'ab')
        np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
        fileObject.write('\n')
        fileObject.close()
    except Exception as err:
        print(f'unable to process {p}: {err}')

This won't prevent any errors, but if an error does occur, rather than crashing, python will print out a message containing information about which model caused an error, and will continue processing the remaining models.

Memory Usage

I also probably have to be conscious of memory usage - I've figured out how to do what I propose with a text file, but it crashes around 2600 lines because I don't think it likes opening a text file that long.

While it is true that memory usage might be a concern with datasets this large, this is not because you are opening a large text file. When python opens a file, it does not immediately load all of the file's data into some variable. In fact, it doesn't load any of the file's data. Rather, it keeps track of where you are in the file, and only loads the contents of the file when you call file.read() (or some other function that reads from the file). In fact, since you opened the file in append mode, your script is unable to read from the file at all.

You can test that file size isn't an issue by making some arbitrarily long file and attempting to write to it. I tested this by running your script with a dummy array, and programming it to write 1,000,000 lines:

import numpy as np

array = np.array([1] * 100)
num_lines = 1_000_000
filename = 'myFile.txt'

for i in range(num_lines):
    fileObject = open(filename, 'a')
    np.savetxt(fileObject, array, delimiter = ',', newline = ' ')
    fileObject.write('\n')
    fileObject.close()

(If you think your issue is opening too large of a file, feel free to run this script yourself and verify the output with wc -l myFile.txt, but be warned that the resulting file is 2.3 GB!)

However, memory usage may be a concern in a different part of this script. In your pseudocode, you are storing the metrics for each model in the dataArray. If your real code is storing a very large number of metrics per model, this might become an issue. Only store these metrics if you need them later (e.g., if future models depend on the metrics of previous models).

Other Improvements

fileObject = open(filename, 'ab')
np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
fileObject.write('\n')
fileObject.close()

As @Reedinationer pointed out, it's better to use a with statement here, to avoid the overhead of having to open and close the file for each row (and because it's best-practice). You don't need to be concerned that you might "lose your progress," as python will automatically flush your data to the file and close the file when the process ends, regardless of whether you use the with statement. You can test this by opening a file, writing to it, and raising an exception before closing it:

import numpy as np

array = np.array([1] * 100)
num_lines = 1_000_000
filename = 'myFile.txt'

fileObject = open(filename, 'a')
for i in range(num_lines):
    if i == 900_000:
        raise Exception
    np.savetxt(fileObject, array, delimiter = ',', newline = ' ')
    fileObject.write('\n')

The resulting file will have 900,000 lines, even though the process exited before closing the file. We also don't need to open the file in binary mode, since np.savetxt writes plain text to the file. With these changes, our loop looks like this:

with open(filename, 'a') as fileObject:
    for p in Parameter1:
        # perform model calculations

        np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
        fileObject.write('\n')

Additionally, rather than calling np.savetxt with newline = ' ', and then manually writing a newline to the file, we can just allow np.savetxt to use the default \n newline character:

np.savetxt(fileObject, rowOfData, delimiter = ',')

With this modification, our loop looks like this:

with open(filename, 'a') as fileObject:
    for p in Parameter1:
        # perform model calculations

        np.savetxt(fileObject, rowOfData, delimiter = ',')

Putting It All Together

Here's what the code looks like with all of our improvements:

with open(filename, 'a') as fileObject:
    for p in Parameter1:
        try:
            # perform model calculations

            np.savetxt(fileObject, rowOfData, delimiter = ',')
        except Exception as err:
            print(f'unable to process {p}: {err}')
rpm
  • 1,266
  • 14
0

Its hard to say with such a limited knowledge of the error, but if you think it is a file writing error maybe you could try something like:

with open(filename, 'ab') as fileObject:
    # code that computes numpy array
    np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
    fileObject.write('\n')
# no need to .close() because the "with open()" will handle it

However

  • I have not used np.savetxt()
  • I am not an expert on your project
  • I do not even know if it is truly a file writing error to begin with

I just prefer the with open() technique because that's how all the introductory python books I've read structure their file reading/writing processes, so I assume there is wisdom in it. You could also consider doing like fabianegli commented and save to separate files (thats what my work does).

Reedinationer
  • 5,661
  • 1
  • 12
  • 33
0

I saw the np. in your pseudocode, so you are using numpy to collect the data. To see inside your data, I would suggest you could use pandas DataFrame to watch which row leads to the crash of your program. You could also use methods to export your DataFrame to file.

For example:

df.to_csv(index=False)

What is irritating me, is the fact that you set newline to a blank resulting to write all data into one line. Could you just leave out this option or set it to \n and try again?

Eric Aya
  • 69,473
  • 35
  • 181
  • 253
JeyJey
  • 41
  • 4
  • They have also added `fileObject.write('\n')`, so the data will not all be written to one line – rpm Mar 21 '23 at 22:17