9

I have a dataframe of params and apply a function to each row. this function is essentially a couple of sql_queries and simple calculations on the result.

I am trying to leverage Dask's multiprocessing while keeping structure and ~ interface. The example below works and indeed has a significant boost:

def get_metrics(row):

    record = {'areaName': row['name'],
              'areaType': row.area_type,
              'borough': row.Borough,
              'fullDate': row['start'],
              'yearMonth': row['start'],
              }


    Q = Qsi.format(unittypes=At,
                   start_date=row['start'],
                   end_date=row['end'],
                   freq='Q',
                   area_ids=row['descendent_ids'])

    sales = _get_DF(Q)
    record['salesInventory'] = len(sales)
    record['medianAskingPrice'] = sales.price.median()
    R.append(record)

R = []
x = ddf.map_partition(lambda x: x.apply(_metric, axis=1), meta={'result': None})
    x.compute()

result2 = pd.DataFrame(R)

However, when I try to use .apply method instead (see below), it throws me 'DataFrame' object has no attribute 'name'...

R = list()
y = ddf.apply(_metrics, axis=1, meta={'result': None})

Yet, ddf.head() shows that there is a name column in the dataframe

Jean-François Corbett
  • 37,420
  • 30
  • 139
  • 188
Philipp_Kats
  • 3,872
  • 3
  • 27
  • 44
  • You write `dask_DF.apply()` but say that `ddf` has a name column. Try `ddf.apply()`. – Mike Müller Oct 13 '17 at 20:04
  • thanks, but that is just (resolved) misspelling, as I try to simplify the code here. It has nothing to do with the issue – Philipp_Kats Oct 13 '17 at 20:06
  • The accepted answer also works for me. But the code sample in the question is too complex, and most of the code is not related to the problem. – Gary Wang Jun 22 '21 at 15:34

1 Answers1

9

If the output of your _metric function is a Series, maybe you should use meta=('your series's columns name','output's dtype')

This worked for me.

Jean-François Corbett
  • 37,420
  • 30
  • 139
  • 188
Cherrymelon
  • 412
  • 2
  • 7
  • 17
  • Could you please explain why using a tuple makes a difference here? This is not apparent from the documentation. – ta8 May 13 '20 at 10:37
  • 2
    I'm sorry,I hadn't use dask for nearly 2 years,I guess meta parameter tell dask the part you want use and output type,maybe because if u don't set dtype,dask maybe infer an error dtpye – Cherrymelon May 14 '20 at 13:16