2

I have code that works in pandas, but I'm having trouble converting it to use dask. There is a partial solution here, but it does not allow me to use a variable as the name of the column I am assigning to.

Here's the working pandas code:

percent_cols = ['num_unique_words', 'num_words_over_6']

def find_fraction(row, col):
    return row[col] / row['num_words']

for c in percent_cols:
    df[c] = df.apply(find_fraction, col=c, axis = 1)

Here's the broken dask code:

data = dd.from_pandas(df, npartitions=8)

for c in percent_cols:
    data = data.assign(c = data[c] / data.num_words)

This assigns the result to a new column called c rather than modifying the value of data[c] (what I want). Creating a new column would be fine if I could have the column name be a variable. E.g., if this worked:

for c in percent_cols:
    name = c + "new"
    data = data.assign(name = data[c] / data.num_words)

For obvious reasons, python doesn't allow an expression left of an = and thus ignores the previous value of name.

How can I use a variable for the name of the column? The for loop iterates far more times than I'm willing to copy/paste.

Community
  • 1
  • 1
kaz
  • 675
  • 2
  • 5
  • 13
  • `.apply` does work the same. It's just that you can't mutate dask DataFrames like you can in pandas, as described in the linked question. In other words, in your code, the `.apply` works fine; it's the `data[c] = ...` that doesn't work. – BrenBarn Oct 20 '15 at 18:23

0 Answers0