Here's my situation: I need to run 300 processes on a cluster (they are independent) that all add a portion of their data to the same DataFrame (they will also need to read the file before writing as a result). They may need to do this multiple times throughout their RunTime.
So I tried using write-locked files, with the portalocker
package. However, I'm getting a type of bug and I don't understand where it's coming from.
Here's the skeleton code where each process will write to the same file:
with portalocker.Lock('/path/to/file.pickle', 'rb+', timeout=120) as file:
file.seek(0)
df = pd.read_pickle(file)
# ADD A ROW TO THE DATAFRAME
# The following part might not be great,
# I'm trying to remove the old contents of the file first so I overwrite
# and not append, not sure if this is required or if there's
# a better way to do this.
file.seek(0)
file.truncate()
df.to_pickle(file)
The above works, most of the time. However, the more simultaneous processes I have write-locking, the more I get an EOFError bug on the pd.read_pickle(file)
stage.
EOFError: Ran out of input
The traceback is very long and convoluted.
Anyway, my thoughts so far are that since it works sometimes, the code above must be fine *(though it might be messy and I wouldn't mind hearing of a better way to do the same thing).
However, when I have too many processes trying to write-lock, I suspect the file doesn't have time save or something, or at least somehow the next processes doesn't yet see the contents that have been saved by the previous process.
Would there be a way around that? I tried adding in time.sleep(0.5)
statements around my code (before the read_pickle
, after the to_pickle
) and I don't think it helped. Does anyone understand what could be happening or know a better way to do this?
Also note, I don't think the write-lock times out. I tried timing the process, and I also added a flag in there to flag if the Write-Lock times out. While there are 300 processes and they might be trying to write and varying rates, in general I'd estimate there's about 2.5 writes per second, which doesn't seem like it should overload the system no?*
*The pickled DataFrame has a size of a few hundred KB.