0

Right now I'm performing multi hot encoding in vanilla numpy but I'd like to port the code to dask.

import numpy as np

data = np.array([
    [1, 4, 77, 87, 100, 101, 102, 121],
    [12, 41, 58, 67, 81, 84, 96, 111],
    [31, 33, 35, 50, 60, 70, 92, 99],
])

multihot = np.eye(128, dtype=bool)[data]
multihot = np.logical_or.reduce(multihot, axis=1)

print(multihot.astype(int))

The previous block of code produce this output

[[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
  1. Is there a more efficient and dask compatible way to perform this type of conversion? (The number of categories is 128 and each row of data is ordered in ascending order)

  2. Can I port this procedure in dask? The problem with this block of code is that dask does not support slicing with lists in multiple axes. Moreover dask.array.Array.vindex does not support indexing with dask objects, you first have to call compute (e.g. da.eye(128, dtype=bool).vindex[data.compute()]) but data is a huge dask array and does not fits in memory.

S1M0N38
  • 131
  • 2
  • 10
  • Check out [`dask_ml.preprocessing.DummyEncoder`](https://ml.dask.org/modules/generated/dask_ml.preprocessing.DummyEncoder.html) - it does exactly what you're looking for! – Michael Delgado Dec 02 '22 at 08:22

0 Answers0