One hot encoding from numpy

Question

I am trying to understand values output from an example python tutorial. The output doesent seem to be in any order that I can understand. The particular python lines are causing me trouble :

vocab_size = 13   #just to provide all variable values
m = 84 #just to provide all variable values
Y_one_hot = np.zeros((vocab_size, m))
Y_one_hot[Y.flatten(), np.arange(m)] = 1

The input Y.flatten() is evaluated as the following numpy-array :

  [ 8  9  7  4  9  7  8  4  8  7  8 12  4  8  9  8 12  7  8  9  7 12  7  2
  9  7  8  7  2  0  7  8 12  2  0  8  8 12  7  0  8  6 12  7  2  8  6  5
  7  2  0  6  5 10  2  0  8  5 10  1  0  8  6 10  1  3  8  6  5  1  3 11
  6  5 10  3 11  5 10  1 11 10  1  3]

np arrange is a tensor ranging from 0-83

np.arange(m)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83]

Ok so the output that I am having trouble understanding from the new Y_one_hot is that I recieve a numpy array of size 13 (as expected) but I do not understand why the positions of the ones are located where they are located based on the Y.flatten() input for example here is the first array of the 13:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0]

Could someone please explain how I got from that input value to that output array from that single line? It seems like the ones are in random positions and in some other arrays of the 13 the number of ones also seems to be random. Is this the intended behavior?

here is a full runnable example:

import numpy as np
import sys
import re



# turn Y into one hot encoding
Y =  np.array([ 8,  9,  7,  4 , 9,  7,  8,  4,  8,  7,  8, 12,  4,  8,  9,  8, 12,  7,  8,  9,  7, 12,  7,  2,
  9,  7,  8,  7,  2,  0,  7,  8, 12,  2,  0,  8,  8, 12,  7,  0,  8,  6, 12,  7,  2,  8,  6,  5,
  7,  2,  0,  6,  5, 10,  2,  0,  8,  5, 10,  1,  0,  8,  6, 10,  1,  3,  8,  6,  5,  1,  3, 11,
  6,  5, 10,  3, 11,  5, 10,  1, 11, 10,  1,  3])
m = 84
vocab_size = 13
Y_one_hot = np.zeros((vocab_size, m))
Y_one_hot[Y.flatten(), np.arange(m)] = 1
np.set_printoptions(threshold=sys.maxsize)
print(Y_one_hot.astype(int))

`Y.flatten()` is selecting indices in the first dimension. `np.arange(m)` is selecting indices in the second dimension. - Using the first item from each - `Y_one_hot[8,0] = 1`. — wwii, Jan 09 '21 at 14:13
`Is this the intended behavior?` - are you asking why your assignment expression worked that way or are you asking if that is the correct way to make the encoding? — wwii, Jan 09 '21 at 14:15
Both in a way, Im reading the answers now to try and understand the behaviour but as it applies exactly to the values of the example I posted (but atleast the answers are explaining the behaviour with minimal examples. Its the concept of the columns that seem to be confusing since in the answers posted i can follow why there is a 1 in say the 4th column of the first array, but the dimensionality of my 13 by 84 numpy array seems to be confusing me in how the first 1 value is in the 30th column of the first array and so im trying to understand the system there... — D3181, Jan 09 '21 at 15:15
`np.vstack((Y,np.arange(m))).T` will show you how the indices are being *paired up*. You can see the the 30th entry (`np.vstack((Y,np.arange(m))).T[29]`) is `[0,29]`. So your expression is assigning a one to `Y_one_hot[0,29]` - if that still is not making sense to you, you need to spend more time with the [Numpy documentation](https://numpy.org/doc/stable/user/tutorials_index.html) and playing around with the examples - SO isn't a Tutorial. The doc reference linked to in jakevdp's answer is relevant to your question. — wwii, Jan 09 '21 at 15:43

Ivan · Answer 1 · 2021-01-10T15:04:39.173

The code you showed is a quick way to convert multiple label indices to one-hot-encodings.

Let's do it with a single index, and convert it to a one-hot-encoding vector. To keep it simple, we will stick with an encoding size of 10 (i.e. nine 0s and one 0):

>>> y = 4
>>> y_ohe = np.zeros(10)
>>> y_ohe[y] = 1
array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])

Now, let's try with more than one index: 5 labels at the same time. The starting array would be two-dimensional: (5, 10), i.e. a one-hot-encoding vector of size 10 per label.

>>> y = np.array([4, 2, 1, 7])
>>> y_ohe = np.zeros((4, 10))
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

The desired result is:

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 7., 0., 0.]])

To do so we will index by row and by column: np.arange(len(y)) will give us all rows indices, while y will give us the columns where the 1 are supposed to be. Since np.arange(len(y)) and y have the same length, they will be iterated over zipped, something like

>>> for i, j in zip(np.arange(len(y)), y):
>>>     print(i, j)
[0, 4]
[1, 2]
[2, 1]
[3, 7]

These are the [i, j] coordinates in the 2D tensor y_ohe where we want 1s to be.

Assign the indexed value to 1s:

>>> y_ohe[np.arange(len(y)), y] = 1
array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])

Similarly, by indexing the other way around:

>>> y = np.array([4, 2, 1, 7])
>>> y_ohe = np.zeros((10, 4))
>>> y_ohe[y, np.arange(len(y))] = 1
array([[0., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In your case Y had an extra dimension, something like Y = np.array([[4], [2], [1], [7]]) to relate to the example I gave above. Which would give y after being flattened.

jakevdp · Accepted Answer · 2021-01-09T17:26:41.400

2

The line Y_one_hot[Y.flatten(), np.arange(m)] = 1 is setting values of the array with lists of integer indices (Documented at Integer Array Indexing)

The arrays of indices are broadcast together, and the result for 1D arrays is essentially an efficient way to do this:

for i, j in zip(Y.flatten(), np.arange(m)):
    Y_one_hot[i, j] = 1

In words, each column of Y_one_hot corresponds to an entry of Y.flatten(), and has a single nonzero value in the row given by the entry.

It may be easier to see with a smaller array:

Y_onehot = np.zeros((2, 3), dtype=int)
Y = np.array([0, 1, 0])

Y_onehot[Y.flatten(), np.arange(3)] = 1

print(Y_onehot)
# [[1 0 1]
#  [0 1 0]]

Three entries map to three columns, and each column has a single nonzero entry in the row corresponding to the value.

edited Jan 09 '21 at 17:26

answered Jan 09 '21 at 13:52

jakevdp

77,104
11
125
160

You might want to *show* how the indices get paired up - all the `[i, j]`'s from your example. ... `np.vstack((Y,np.arange(m))).T`. OP still isn't seeing it. – wwii Jan 09 '21 at 15:47
Or add, `for i, j in zip(Y.flatten(), np.arange(m)): print(f'Y_one_hot[{i}, {j}] = 1') ` – wwii Jan 09 '21 at 15:52
Both answers gave a good breakdown of the problem and helped me to understand how the values were assigned. With the addition of wwii's comments i was able to understand the logic of what was happening on this example more easily so it was hard to choose a "correct answer" for my question. I recommend anyone reading this to review both answers as they were both valid now that I understand what is happening both make sense. – D3181 Jan 09 '21 at 17:10

One hot encoding from numpy

2 Answers2