11

I'm using Joblib to cache results of a computationally expensive function in my python script. The function's input arguments and return values are numpy arrays. The cache works fine for a single run of my python script. Now I want to spawn multiple runs of my python script in parallel for sweeping some parameter in an experiment. (The definition of the function remains same across all the runs).

Is there a way to share the joblib cache among multiple python scripts running in parallel? This would save a lot of function evaluations which are repeated across different runs but do not repeat within a single run. I couldn't find if this is possible in Joblib's documentation

Neha Karanjkar
  • 3,390
  • 2
  • 29
  • 48
  • 1
    If you're already parallelizing within a single run of your script, I don't think there's much to be gained by trying to parallelize across multiple runs as well. I suppose you could potentially do better by re-using the cache from a previous run. I've never tried this, but I would guess that you could do it by using the same `joblib.Memory` object across consecutive runs. – ali_m Jul 30 '14 at 11:14
  • @ali_m: A single run is parallelized, but I need to run multiple runs in parallel as well, because each run takes several days and I have a lot of cores (I'm running these on a cluster). If joblib's cache is a file, then it seems it should be possible for multiple processes to share it...I don't know how. – Neha Karanjkar Jul 30 '14 at 11:27
  • What does your core utilization look like when you're doing a single run? If you're already using all of your cores on a single run then there's no way you'll do any better by parallelizing across runs as well - the additional worker threads will just be competing for the same set of cores, and you may well see performance degradation due to extra threading overhead and cache fighting. It might make more sense to just parallelize across runs instead of within a single run - that way you will spend proportionally less time spawning and terminating threads rather than doing your computation. – ali_m Jul 30 '14 at 12:27
  • 1
    If you `mem.cache` the functionality that repeats itself then this should work out of the box. At least on one machine with multiprocessing. On a cluster of several machines that don't share disk space it is an entirely different matter. If they do share disk space and you put the cache there, I don't see why it shouldn't work. – eickenberg Jul 30 '14 at 19:17
  • 1
    @eickenberg...Thanks!! :) I guess I was using cachedir = mkdtemp() and that's why it wasn't working before. It works as long as the same directory is used by both processes to hold the cache. – Neha Karanjkar Jul 31 '14 at 06:07
  • @eickenberg please write your comment as answer and I will accept – Neha Karanjkar Jul 31 '14 at 06:15
  • Glad that helped, it is very useful for recurrent functions that take significantly less time to load from cache than to calculate. – eickenberg Jul 31 '14 at 07:28

1 Answers1

14

Specify a common, fixed cachedir and decorate the function that you want to cache using

from joblib import Memory
mem = Memory(cachedir=cachedir)

@mem.cache
def f(arguments):
    """do things"""
    pass

or simply

def g(arguments):
   pass

cached_g = mem.cache(g)

Then, even if you are working across processes, across machines, if all instances of your program have access to cachedir, then common function calls can be cached there transparently.

eickenberg
  • 14,152
  • 1
  • 48
  • 52
  • 5
    Indeed, we (the joblib development team) are careful to design the disk-based store in such a way that it is robust to parallel access (and mostly to parallel writes). As a side note, I tend to prefer the 2nd syntax to the first one in the above answer. – Gael Varoquaux Oct 24 '15 at 15:01
  • @GaelVaroquaux, Can you please elaborate why you prefer the latter? I have `@mem.cache` all over my Tornado web app and am wondering if there is a reason I should refactor them to the recommended alternative. Thanks! – Kevin Ghaboosi Apr 15 '16 at 18:11
  • @GaelVaroquaux Also, I wonder if it's worth decorating a function for async access, like using `@gen.coroutine`` if the fetch task takes longer than usual and the function is called from an HTTP endpoint or delay sensitive client. Thanks! – Kevin Ghaboosi Apr 15 '16 at 18:18
  • 1
    I think the `@`-notation is just a shorthand version of the second. So the second makes it explicit what decoration means and it gives you the possibility not to lose the original non-decorated function. There can be situations, especially in interactive sessions where only the second option works due to the name change. – eickenberg Apr 15 '16 at 21:41
  • 1
    @GaelVaroquaux Replying a long time later.... When you say "mostly to parallel writes," what is meant by "mostly". Any important gotcha's? Are they documented somewhere? Thanks. – Caleb Dec 07 '20 at 15:39