Are there any packages in Python that support concurrent writes on NFS using a serverless architecture?
I work in an environment where I have a supercomputer, and multiple jobs save their data in parallel. While I can save the result of these computations in separate files, and combine their results later, this requires me to write a reader that is aware of the specific way in which I split my computation across jobs, so that it knows how to stitch everything in a final data structure correctly.
Last time I checked SQLite did not support concurrency in NFS. Are there any alternatives to SQLite?
Note: By serverless I mean avoiding to explicitly start another server (on top of NFS) that handles the IO requests. I understand that NFS uses a client-server architecture, but this filesystem is already part of the supercomputer that I use. I do not need to maintain myself. What I am looking for is a package or file format that supports concurrent IO without requiring me to set-up any (additional) servers.
Example:
Here is an example of two jobs that I would run in parallel:
Job 1 populates
my_dict
from scratch with the following data, and saves it tofile
:my_dict{'a'}{'foo'} = [0.2, 0.3, 0.4]
Job 2 also populates
my_dict
from scratch with the following data, and saves it tofile
:my_dict{'a'}{'bar'} = [0.1, 0.2]
I want to later load file
, and see the following in my_dict
:
> my_dict{'a'}.items()
[('foo', [0.2, 0.3, 0.4]), ('bar', [2, 3, 5])]
Note that the stitching operation is automatic. In this particular case, I chose to split the keys in my_dict['a']
across the computations, but other splits are possible. The fundamental idea is that there are no clashes between jobs. It implicitly assumes that jobs add/aggregate data, so the fusion of dictionaries (dataframes if using Pandas) always results in aggregating the data, i.e. computing an "outer join" of the data.