I'm confused with regard to the result of dask_ml.preprocessing.OrdinalEncoder.transform:
from sklearn.preprocessing import OrdinalEncoder
from dask_ml.preprocessing import OrdinalEncoder as DaskOrdinalEncoder
import numpy as np
import pandas as pd
N = 10
np.random.seed(1234)
df = pd.DataFrame({
"cat1": np.random.choice(list(string.ascii_uppercase)[0:3], size=N),
"cat2": np.random.choice(list(string.ascii_uppercase)[0:3], size=N),
})
df_dd = dd.from_pandas(df, npartitions=3)
The original OrdinalEncoder.transform returns a numpy.ndarray (with numeric values):
>>> OrdinalEncoder().fit_transform(df)
array([[2., 2.],
[1., 0.],
[0., 0.],
[0., 2.],
[0., 2.],
[1., 2.],
[1., 0.],
[1., 0.],
[2., 0.],
[2., 1.]])
The dask-ml counterpart not just breaks the Interface by returning a pandas.DataFrame it simply returns the initial input DataFrame:
>>> DaskOrdinalEncoder().fit_transform(df_dd).compute().equals(df)
True
What I would expect is either a (Pandas or Dask) DataFrame or a (Numpy or Dask) Array holding numeric values analogous to what the sklearn OrdinalEncoder produces.