Background:
train.csv datasets has over 100m records
Tried on first 1m records
I wrote two functions:
1, func1
: apply to the partitions of the train, and return 1 new dataframe
2, func2
: apply to the partitions of the train, and return 2 new dataframes
Problems:
- Use map_partition(func1), a concat dataframe was returned. -- Worked
(well, i thought it should return a dask dataframe with partitions, but i googled it out, which says dask will concat the result if it returns a dataframe. ok, no problems here)
- Use map_partition(func2), which returns 2 dataframes. -- Failed. I got 'too many values to unpack (expected 2)' error.
(Apparently , I can run the func1 twice which return 1 dataframe each time, but it is time consuming and i think there should have ways to do both in one go...)
Sample code below:
def func1(df):
#data processing
return new_df
def func2(df):
#data processing
return new_df1, new_df2
output1 = train.map_partition(func1, meta = train) # Worked
output1, output2 = train.map_partition(func2, meta = train) # Error:'too many values to unpack (expected 2)'
Let me know if more information needed.