So dask.dataframe.map_partitions()
takes a func
argument and the meta
kwarg. How exactly does it decide its return type? As an example:
Lots of csv's in ...\some_folder.
ddf = dd.read_csv(r"...\some_folder\*", usecols=['ColA', 'ColB'],
blocksize=None,
dtype={'ColA': np.float32, 'ColB': np.float32})
example_func = lambda x: x.iloc[-1] / len(x)
metaResult = pd.Series({'ColA': .1234, 'ColB': .1234})
result = ddf.map_partitions(example_func, meta=metaResult).compute()
I'm pretty new to "distributed" computing, but I would intuitively expect this to return a collection (a list or dict, most likely) of Series objects, yet the result is a Series object that could be considered a concatenation of the results of example_func on each partition. This in and of itself would also suffice, if this series had a MultiIndex to indicate the partition label.
From what I can tell from this question, the docs, and the source code itself, this is because ddf.divisions
will return a (None, None, ..., None)
as a result of reading csv's? Is there a dask-native way to do this, or do I need to manually go in and break the returned Series (a concatenation of the Series that were returned by example_func
on each partition) myself?
Also, feel free to correct my assumptions/practices here, as I'm new to dask.