7

Are there any packages in Python that support concurrent writes on NFS using a serverless architecture?

I work in an environment where I have a supercomputer, and multiple jobs save their data in parallel. While I can save the result of these computations in separate files, and combine their results later, this requires me to write a reader that is aware of the specific way in which I split my computation across jobs, so that it knows how to stitch everything in a final data structure correctly.

Last time I checked SQLite did not support concurrency in NFS. Are there any alternatives to SQLite?

Note: By serverless I mean avoiding to explicitly start another server (on top of NFS) that handles the IO requests. I understand that NFS uses a client-server architecture, but this filesystem is already part of the supercomputer that I use. I do not need to maintain myself. What I am looking for is a package or file format that supports concurrent IO without requiring me to set-up any (additional) servers.

Example:

Here is an example of two jobs that I would run in parallel:

  • Job 1 populates my_dict from scratch with the following data, and saves it to file :

    my_dict{'a'}{'foo'} = [0.2, 0.3, 0.4]

  • Job 2 also populates my_dict from scratch with the following data, and saves it to file:

    my_dict{'a'}{'bar'} = [0.1, 0.2]

I want to later load file, and see the following in my_dict:

> my_dict{'a'}.items()
[('foo', [0.2, 0.3, 0.4]), ('bar', [2, 3, 5])]

Note that the stitching operation is automatic. In this particular case, I chose to split the keys in my_dict['a'] across the computations, but other splits are possible. The fundamental idea is that there are no clashes between jobs. It implicitly assumes that jobs add/aggregate data, so the fusion of dictionaries (dataframes if using Pandas) always results in aggregating the data, i.e. computing an "outer join" of the data.

Josh
  • 11,979
  • 17
  • 60
  • 96
  • Your question is very confusing because you talk about NFS but then use the term 'serverless'. Since NFS always has a server, it doesn't make sense. Can you rephrase it? – Gabe Nov 25 '13 at 21:37
  • Thank you @Gabe - I have updated my OP to address your question. – Josh Nov 25 '13 at 21:42
  • So the Python script you want to write will write to a mounted NFS volume? In that arrangement your python script is the client (no extra servers needed :) – Jason Sperske Nov 25 '13 at 21:46
  • @JasonSperske correct. – Josh Nov 25 '13 at 21:48
  • 1
    This is an *extremely* difficult problem. You're not likely to find anything already written. – Gabe Nov 25 '13 at 23:46
  • @Josh I agree with @Gabe, your requirements are too hard. If you would relax your "serverless" requirement, I would recommend trying e.g. Redis key-value store - it is fast, available for Linux as well Windows and is very easy to use with `redis` package. But it is definitely a server. – Jan Vlcinsky Apr 18 '14 at 01:36

1 Answers1

1

Simple DIY, potentially flaky

Hierarchical locking -- i.e. you lock / first, then lock /foo and unlock /, then lock /foo/bar and unlock /foo. Make changes to /foo/bar and unlock it.

This allows other processes access to other paths. Lock contention on / is relatively small.

Complicated DIY

Adapt a lock-free or wait-free algorithm, e.g. RCU. Pointers become symlinks or files containing lists of other paths.

http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html https://dank.qemfd.net/dankwiki/index.php/Lock-free_algorithms

Dima Tisnek
  • 11,241
  • 4
  • 68
  • 120