The dask API says, that map_partition can be used to "apply a Python function on each DataFrame partition." From this description and according to the usual behaviour of "map", I would expect the return value of map_partitions to be (something like) a list whose length equals the number of partitions. Each element of the list should be one of the return values of the function calls.
However, with respect to the following code, I am not sure, what the return value depends on:
#generate example dataframe
pdf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(pdf, npartitions=3)
#define helper function for map. VAL is the return value
VAL = pd.Series({'A': 1})
#VAL = pd.DataFrame({'A': [1]}) #other return values used in this example
#VAL = None
#VAL = 1
def helper(x):
print('function called\n')
return VAL
#check result
out = ddf.map_partitions(helper).compute()
print(len(out))
VAL = pd.Series({'A': 1})
causes 4 function calls (probably one to infer the dtype and 3 for the partitions) and an output with len == 3 and the type pd.Series.pd.DataFrame({'A': [1]})
results in the same numbers, however the resulting type is pd.DataFrame.VAL = None
causes an TypeError ... why? Couldn't a possible use of map_partitions be to do something rather than to return something?VAL = 1
results in only 2 function calls. The result of map_partitions is the integer 1.
Therefore, I want to ask some questions:
- how is the return value of map_partitions determined?
- What influences the number of function calls besides the number of partitions / What criteria has a function to fulfil to be called once with each partition?
- What should be the return value of a function, that only "does" something, i.e. a procedure?
- How should a function be designed, that returns arbitrary objects?