In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations.
Code snippet:
from dask import dataframe as dd
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': np.arange(5),
'B': np.arange(5),
'C': np.arange(5)})
ddf = dd.from_pandas(df, npartitions=1)
def aggregate(x):
print('B val received: ' + str(x.B))
return x
ddf.apply(aggregate, axis=1).compute()
But when the above code is run, I see this instead:
B val received: 1
B val received: 1
B val received: 1
B val received: 0
B val received: 0
B val received: 1
B val received: 2
B val received: 3
B val received: 4
Instead of 0 - 4, I see a series of 1 printed first, and an extra 0. I have noticed the "extra" rows of value 1 occurring every time I have set up a Dask DataFrame and run an apply
operation on it.
Printing the dataframe shows no additional rows with value 1 throughout:
A B C
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
My question is: Where are these rows with value 1 coming from? Why do they appear to consistently occur prior to the "actual" rows in the dataframe? The 1 values seem unrelated to the values in the actual rows (that is, it is not as though it is for some reason grabbing the second row an extra few times).