1

I'm trying to apply a dask-ml QuantileTransformer transformation to a percentage field, and create a new field percentage_qt in the same dataframe. But I get the error Array assignment only supports 1-D arrays. How to make this work?

import pandas as pd
import dask.dataframe as dd
from dask_ml.preprocessing import QuantileTransformer

mydict = [{'percentage': 12.1, 'b': 2, 'c': 3, 'd': 4},
      {'percentage': 10.2, 'b': 200, 'c': 300, 'd': 400},
      {'percentage': 11.3, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
ddf = dd.from_pandas(df, npartitions=10)

qt = QuantileTransformer(n_quantiles=100)
x = ddf[['percentage']]
y = qt.fit_transform(x)
ddf['percentage_qt'] = y # <-- error happens here
rpanai
  • 12,515
  • 2
  • 42
  • 64
ps0604
  • 1,227
  • 23
  • 133
  • 330

1 Answers1

1

The error you get is the following

ValueError: Array assignment only supports 1-D arrays

A y is not an array. You could use this trick

Transform y to dask dataframe using the same indices as ddf

dfy = y.to_dask_dataframe(
    columns=['percentage_qt'],
    index=ddf.index)

For some strange reason concat on 0 axis doesn't work (maybe we should open an issue on GH) so we can join the two dataframes as

ddf_out = ddf.join(dfy)

Which returns the expected output

print(ddf_out.compute())
   percentage     b     c     d  percentage_qt
0        12.1     2     3     4       1.000000
1        10.2   200   300   400       0.000000
2        11.3  2000  3000  4000       0.656772
rpanai
  • 12,515
  • 2
  • 42
  • 64
  • I have problems, I think axis should be 1 instead of 0, look at [this](https://docs.dask.org/en/latest/generated/dask.dataframe.multi.concat.html) – ps0604 Feb 02 '22 at 03:34
  • If I set axis=1, I get the following error: `Unable to concatenate DataFrame with unknown division specifying axis=1` where `dfy.known_divisions = False` – ps0604 Feb 02 '22 at 04:08
  • You are right. there are some problems with indices. – rpanai Feb 02 '22 at 09:51
  • I changed my answer. – rpanai Feb 02 '22 at 09:59
  • rpanai, why did you use a join instead of `ddf['percentage_qt'] = dfy['percentage_qt']` ? – ps0604 Feb 02 '22 at 22:15