I have code that works in pandas
, but I'm having trouble converting it to use dask
. There is a partial solution here, but it does not allow me to use a variable as the name of the column I am assigning to.
Here's the working pandas
code:
percent_cols = ['num_unique_words', 'num_words_over_6']
def find_fraction(row, col):
return row[col] / row['num_words']
for c in percent_cols:
df[c] = df.apply(find_fraction, col=c, axis = 1)
Here's the broken dask
code:
data = dd.from_pandas(df, npartitions=8)
for c in percent_cols:
data = data.assign(c = data[c] / data.num_words)
This assigns the result to a new column called c
rather than modifying the value of data[c]
(what I want). Creating a new column would be fine if I could have the column name be a variable. E.g., if this worked:
for c in percent_cols:
name = c + "new"
data = data.assign(name = data[c] / data.num_words)
For obvious reasons, python doesn't allow an expression left of an =
and thus ignores the previous value of name
.
How can I use a variable for the name of the column? The for loop iterates far more times than I'm willing to copy/paste.