2

I have a numpy 2-D array with categorical data at every column.

I try to separately encode the data at each column while possibly dealing with unseen data at each case.

I have this code:

from sklearn.preprocessing import LabelEncoder

for column in range(X_train.shape[1]):

    label_encoder = LabelEncoder()

    X_train[:, column] = label_encoder.fit_transform(X_train[:, column])

    mappings = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

    map_function = lambda x: mappings.get(x, -1)

    X_test[:, column] = map_function(X_test[:, column])

and I get this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-***********> in <module>
     39         mappings = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
     40         map_function = lambda x: mappings.get(x, -1)
---> 41         X_test[:, column] = map_function(X_test[:, column])
     42 
     43 

<ipython-input-***********> in <lambda>(x)
     38         X_train[:, column] = label_encoder.fit_transform(X_train[:, column])
     39         mappings = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
---> 40         map_function = lambda x: mappings.get(x, -1)
     41         X_test[:, column] = map_function(X_test[:, column])
     42 

TypeError: unhashable type: 'numpy.ndarray'

How can I fix this?

In general, would you suggest a better way to do what I want to do?

P.S.

I tried to do this to see what is happening:

for column in range(X_train.shape[1]):
    label_encoder = LabelEncoder()
    X_train[:, column] = label_encoder.fit_transform(X_train[:, column])
    mappings = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

    try:
        map_function = lambda x: mappings.get(x, -1)
        X_test[:, column] = map_function(X_test[:, column])
    except:
        print(X_test[:, column])
        for i in range(X_test[:, column].shape[0]):
            if isinstance(X_test[i, column],np.ndarray):
                print(X_test[i, column])
        print()

but actually nothing was printed by print(X_test[i, column]) so I am not sure if there is any numpy array within X_test[:, column].

I have actually also checked that if not isinstance(X_test[i, column],str) and again nothing was printed so everything in X_train[:, column] at each column must be a string.

P.S.2

When I do this:

 for i in range(X_test[:, column].shape[0]):
     X_test[i, column] = mappings.get(X_test[i, column], -1)

it actually works with no error so it means that for some reason in the way I have defined the lambda function I sent the whole numpy array to it than its element separately.

Outcast
  • 4,967
  • 5
  • 44
  • 99
  • It seems that one or more of the values in the `X_test[:, column]` is a `np.ndarray` - I suggest you surround the last line with `try/except TypeError` and examine the value that throws it – bluesummers Sep 05 '19 at 09:43
  • @bluesummers, thank you for comment. Check my edited post at the bottom. I am not sure if there is any numpy array within `X_test[:, column]`. Let me know your thoughts on this. – Outcast Sep 05 '19 at 09:59
  • Can you try to swap `map_function = lambda x: mappings.get(x, -1); X_test[:, column] = map_function(X_test[:, column])` with `map_function = np.vectorize(lambda x: mappings.get(x, -1)); X_test[:, column] = map_function(X_test[:, column])` – bluesummers Sep 05 '19 at 10:41
  • @bluesummers, yes good point, it actually works with what you said. (You can write it as a proper answer if you want and I will upvote it etc). See also my PS2 at my edited post. Something was happening with the `lambda function` only. But I did that because I followed this answer: https://stackoverflow.com/a/35216364/9024698 which was suggesting to follow the 'direct' method and directly apply the lambda function without `vectorizer`. – Outcast Sep 05 '19 at 10:46

1 Answers1

5

What happens here is that what is sent to the map_function is the actual vector, which cannot be used as a key in a dictionary because it is not hashable, hence the error.

switch the row

map_function = lambda x: mappings.get(x, -1)

with

map_function = np.vectorize(lambda x: mappings.get(x, -1))

This will cause each element to be used as the key in the mapping, and if all of them are indeed hashable it would work.

bluesummers
  • 11,365
  • 8
  • 72
  • 108
  • Hey, thanks (upvote) ;) but just out of curiosity why the whole vector is sent? In many other examples which I have seen the individual elements are sent in this way. For example try `import numpy as np; x = np.array([1, 2, 3, 4, 5]); f = lambda x: x ** 2; squares = f(x); print(squares)`. It works fine without any `vectorizer` etc. – Outcast Sep 05 '19 at 10:51
  • This is because `x ** 2` is an operator supported by `np.ndarray`s (like other numpy functions), but dictionary lookup is not. So actually in your examples the whole vector is sent too, it just knows that you want to apply that function to each element. – bluesummers Sep 05 '19 at 10:56
  • Hm ok but I am not sure in what way `x ** 2` is really supported by `np.ndarrays`. This is just an operation/function as the dictionary lookup so I do not see the difference really. Also as you see at my `P.S.2` the lookup works just fine for `np.ndarrays`. So it is more about the `lambda` function I think than the `np.ndarray`. – Outcast Sep 05 '19 at 11:00
  • In PS 2 it does not work on the whole array, you pass elements to it. np.arrays support things like `**` per implementation, that means that numpy knows if you use **/+/- etc. on an array with a scalar it will apply it to each element. The lambda is not a problem and you can see this if you try to do this `np.array(map(lambda x: mapping.get(x, -1), X_test[:, column]))` which will work, because we explicitly use it on each element by specifying `map` – bluesummers Sep 05 '19 at 11:06
  • Hm maybe; not sure – Outcast Sep 05 '19 at 11:14
  • In the first sentence of the previous comment I meant _In PS 2 it does not work on the whole array, numpy knows to pass elements to it by itself._ – bluesummers Sep 05 '19 at 11:29
  • Hm I see your point at your comment. You maybe right. But I supposed that since `numpy` sees a `lambda` function then it passes the elements anyways no matter what exactly this `lambda` function does. This would make more sense to me. – Outcast Sep 05 '19 at 11:51
  • It does not require lambda to do so, you can actually take a numpy array and use `** 2` directoy on it, as in `my_np_array ** 2`, will work on each element separately, even though there is no lambda in it (assuming `my_np_array` is a numpy vector) – bluesummers Sep 05 '19 at 12:09
  • Hm ok I see your point. Although I thought the by definition role of the lambda function is to take the elements of a list or numpy array or pandas column etc. But now that I am thinking it is basically what you say. If a function can be directly applied on a whole array (eg my_np_array ** 2) then lambda will take the elements of the array while since `.get()` cannot be applied directly on the whole array then the lambda won't automatically take the elements of the array but the array itself. – Outcast Sep 05 '19 at 12:53