3

If I have a function that depends on some global or other constant like the following:

x = 123

def f(partition):
    return partition + x  # note that x is defined outside this function

df = df.map_partitions(f)

Does this work? Or do I need to include the external variable, x, explicitly somehow?

MRocklin
  • 55,641
  • 23
  • 163
  • 235

1 Answers1

2

Single process

If you're on a single machine and not using dask.distributed, then this doesn't matter. The variable x is present and doesn't need to be moved around

Distributed or multi-process

If we have to move the function between processes then we'll need to serialize that function into a bytestring. Dask uses the library cloudpickle to do this.

The cloudpickle library converts the Python function f into a bytes object in a way that captures the external variables in most settings. So one way to see if your function will work with Dask is to try to serialize it and then deserialize it on some other machine.

import cloudpickle
b = cloudpickle.dumps(f)

cloudpickle.loads(b)  # you may want to try this on your other machine as well

How cloudpickle achieves this can be quite complex. You may want to look at their documentation.

MRocklin
  • 55,641
  • 23
  • 163
  • 235