3

Background:

  • train.csv datasets has over 100m records

  • Tried on first 1m records

I wrote two functions:

1, func1: apply to the partitions of the train, and return 1 new dataframe

2, func2: apply to the partitions of the train, and return 2 new dataframes

Problems:

  1. Use map_partition(func1), a concat dataframe was returned. -- Worked

(well, i thought it should return a dask dataframe with partitions, but i googled it out, which says dask will concat the result if it returns a dataframe. ok, no problems here)

  1. Use map_partition(func2), which returns 2 dataframes. -- Failed. I got 'too many values to unpack (expected 2)' error.

(Apparently , I can run the func1 twice which return 1 dataframe each time, but it is time consuming and i think there should have ways to do both in one go...)

Sample code below:

def func1(df):

   #data processing

   return new_df


def func2(df):

   #data processing

   return new_df1, new_df2


output1 = train.map_partition(func1, meta = train)  # Worked

output1, output2 = train.map_partition(func2, meta = train)  # Error:'too many values to unpack (expected 2)'

Let me know if more information needed.

mdurant
  • 27,272
  • 5
  • 45
  • 74
Argos.LEE
  • 139
  • 2
  • 6
  • map_partitions can't return multiple outputs. One alternative may be to convert your Dask DataFrame to [Delayed objects](https://docs.dask.org/en/latest/delayed.html) and work on that. – pavithraes Oct 07 '21 at 14:04

0 Answers0