How to one-hot-encode sentences at the character level?

Question

I would like to convert a sentence to an array of one-hot vector. These vector would be the one-hot representation of the alphabet. It would look like the following:

"hello" # h=7, e=4 l=11 o=14

would become

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Unfortunately OneHotEncoder from sklearn does not take as input string.

Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation. [on topic](http://stackoverflow.com/help/on-topic) and [how to ask](http://stackoverflow.com/help/how-to-ask) apply here. StackOverflow is not a design, coding, research, or tutorial service. — Prune, Apr 25 '17 at 18:32
That said, consult the documentation on the **chr** and **ord** methods. — Prune, Apr 25 '17 at 18:33
What I tried so far is the following (to applied to each sentence in a corpus) but I was wondering if a simpler solution exists sentence_chars = [c for c in sentence.lower() if c in alphabet] ohv = label_binarize(sentence_chars, classes=list(alphabet)) ohv = ohv.astype(bool) — user6903745, Apr 25 '17 at 18:44
OneHotEncoder from sklearn has now merged with CategoricalEncoder so this should be possible with sklearn.preprocessing.OneHotEncoder(categories="auto") now. (This is the default representation for sequential models like LSTMs) https://github.com/scikit-learn/scikit-learn/blob/e27242a62d18425886e540c213da044f209d43a8/sklearn/preprocessing/_encoders.py#L106 — devssh, Jul 02 '18 at 09:57

blacksite · Accepted Answer · 2017-04-25T18:48:31.287

Just compare the letters in your passed string to a given alphabet:

def string_vectorizer(strng, alphabet=string.ascii_lowercase):
    vector = [[0 if char != letter else 1 for char in alphabet] 
                  for letter in strng]
    return vector

Note that, with a custom alphabet (e.g. "defbcazk", the columns will be ordered as each element appears in the original list).

The output of string_vectorizer('hello'):

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

kmario23 · Answer 2 · 2020-04-17T15:30:29.400

This is a common task in Recurrent Neural Networks and there's a specific function just for this purpose in tensorflow, if you'd like to use it.

alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}

idxs = [alphabets[ch] for ch in 'hello']
print(idxs)
# [7, 4, 11, 11, 14]

# @divakar's approach
idxs = np.fromstring("hello",dtype=np.uint8)-97

# or for more clear understanding, use:
idxs = np.fromstring('hello', dtype=np.uint8) - ord('a')

one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
sess = tf.InteractiveSession()

In [15]: one_hot.eval()
Out[15]: 
array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

and why do we have to substract 97? I figured out one_hot does not work properly without that... — gregoruar, Apr 17 '20 at 13:14
@gregoruar 'cause 97 is the ASCII code for the starting of alphabets (i.e. `a`). See this page here: https://theasciicode.com.ar/ascii-printable-characters/minus-sign-hyphen-ascii-code-45.html for more details — kmario23, Apr 17 '20 at 15:16

ayhan · Answer 3 · 2017-04-25T18:45:41.210

With pandas, you can use pd.get_dummies by passing a categorical Series:

import pandas as pd
import string
low = string.ascii_lowercase

pd.get_dummies(pd.Series(list(s)).astype('category', categories=list(low)))
Out: 
   a  b  c  d  e  f  g  h  i  j ...  q  r  s  t  u  v  w  x  y  z
0  0  0  0  0  0  0  0  1  0  0 ...  0  0  0  0  0  0  0  0  0  0
1  0  0  0  0  1  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
2  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
3  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
4  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0

[5 rows x 26 columns]

score 3 · Answer 4 · answered Apr 25 '17 at 18:52

Here's a vectorized approach using NumPy broadcasting to give us a (N,26) shaped array -

ints = np.fromstring("hello",dtype=np.uint8)-97
out = (ints[:,None] == np.arange(26)).astype(int)

If you are looking for performance, I would suggest using an initialized array and then assign -

out = np.zeros((len(ints),26),dtype=int)
out[np.arange(len(ints)), ints] = 1

Sample run -

In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97

In [154]: ints
Out[154]: array([ 7,  4, 11, 11, 14], dtype=uint8)

In [155]: out = (ints[:,None] == np.arange(26)).astype(int)

In [156]: print out
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]

score 2 · Answer 5 · answered Apr 25 '17 at 18:36

You asked about "sentences" but your example provided only a single word, so I'm not sure what you wanted to do about spaces. But as far as single words are concerned, your example could be implemented with:

def onehot(ltr):
 return [1 if i==ord(ltr) else 0 for i in range(97,123)]

def onehotvec(s):
 return [onehot(c) for c in list(s.lower())]

onehotvec("hello")
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

How to one-hot-encode sentences at the character level?

5 Answers5

Linked