0

I'm learning dask and I want to generate random strings. But this only works if the import statements are inside the function f.

This works:

import dask
from dask.distributed import Client, progress

c = Client(host='scheduler')

def f():
    from random import choices
    from string import ascii_letters
    rand_str = lambda n: ''.join(choices(population=list(ascii_letters), k=n))
    return rand_str(5)

xs = []
for i in range(3):
    x = dask.delayed(f)()
    xs.append(x)

res = c.compute(xs)
print([r.result() for r in res])

This prints something like ['myvDi', 'rZnYO', 'MyzaG']. This is good, as the strings are random.

This, however, doesn't work:

from random import choices
from string import ascii_letters
import dask
from dask.distributed import Client, progress

c = Client(host='scheduler')

def f():
    rand_str = lambda n: ''.join(choices(population=list(ascii_letters), k=n))
    return rand_str(5)

xs = []
for i in range(3):
    x = dask.delayed(f)()
    xs.append(x)

res = c.compute(xs)
print([r.result() for r in res])

This prints something like ['tySQP', 'tySQP', 'tySQP'], which is bad because all the random strings are the same.

So I'm curious how I'm going to distribute large non-trivial code. My goal is to be able to pass arbitrary json to a dask.delayed function and have that function perform analysis using other modules, like google's ortools. Any suggestions?

offwhitelotus
  • 1,049
  • 9
  • 15
  • 1
    Interesting, I thought declaring your function as impure (e.g. with `dask.delayed(f, pure=False)`) would have given you the expected results (see [here](https://distributed.dask.org/en/latest/client.html#pure-functions-by-default)). Somehow that's not the case. However, your example works well if you use `dask.compute(xs)` instead of the distributed client (if you try this out, remember to comment out `c = Client(..` as this will register the client with dask)... – malbert Aug 01 '19 at 13:36
  • 1
    Interestingly, when adding `return rand_str(5), time.time()` the output shows three different times, meaning that it does evaluate the function three times. – malbert Aug 01 '19 at 14:21
  • 1
    Performing the random operation using `numpy` makes your code work fine: `rand_str = lambda n: ''.join(np.random.choice(list(ascii_letters), size=n))`. There must be something strange going on with the random library, which goes beyond my understanding though – malbert Aug 01 '19 at 14:29

1 Answers1

0

Python's random module is odd.

It creates some state when it first imports and uses that state when generating random numbers. Unfortunately, having this state around makes it difficult to serialize and move between processes.

Your solution of importing random within your function is what I do.

MRocklin
  • 55,641
  • 23
  • 163
  • 235