3

I would like to convert a sentence to an array of one-hot vector. These vector would be the one-hot representation of the alphabet. It would look like the following:

"hello" # h=7, e=4 l=11 o=14

would become

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Unfortunately OneHotEncoder from sklearn does not take as input string.

kmario23
  • 57,311
  • 13
  • 161
  • 150
user6903745
  • 5,267
  • 3
  • 19
  • 38
  • What have you tried so far? Show us some code! – Klaus D. Apr 25 '17 at 18:28
  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation. [on topic](http://stackoverflow.com/help/on-topic) and [how to ask](http://stackoverflow.com/help/how-to-ask) apply here. StackOverflow is not a design, coding, research, or tutorial service. – Prune Apr 25 '17 at 18:32
  • That said, consult the documentation on the **chr** and **ord** methods. – Prune Apr 25 '17 at 18:33
  • What I tried so far is the following (to applied to each sentence in a corpus) but I was wondering if a simpler solution exists sentence_chars = [c for c in sentence.lower() if c in alphabet] ohv = label_binarize(sentence_chars, classes=list(alphabet)) ohv = ohv.astype(bool) – user6903745 Apr 25 '17 at 18:44
  • 1
    OneHotEncoder from sklearn has now merged with CategoricalEncoder so this should be possible with sklearn.preprocessing.OneHotEncoder(categories="auto") now. (This is the default representation for sequential models like LSTMs) https://github.com/scikit-learn/scikit-learn/blob/e27242a62d18425886e540c213da044f209d43a8/sklearn/preprocessing/_encoders.py#L106 – devssh Jul 02 '18 at 09:57

5 Answers5

10

Just compare the letters in your passed string to a given alphabet:

def string_vectorizer(strng, alphabet=string.ascii_lowercase):
    vector = [[0 if char != letter else 1 for char in alphabet] 
                  for letter in strng]
    return vector

Note that, with a custom alphabet (e.g. "defbcazk", the columns will be ordered as each element appears in the original list).

The output of string_vectorizer('hello'):

[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
blacksite
  • 12,086
  • 10
  • 64
  • 109
9

This is a common task in Recurrent Neural Networks and there's a specific function just for this purpose in tensorflow, if you'd like to use it.

alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}

idxs = [alphabets[ch] for ch in 'hello']
print(idxs)
# [7, 4, 11, 11, 14]

# @divakar's approach
idxs = np.fromstring("hello",dtype=np.uint8)-97

# or for more clear understanding, use:
idxs = np.fromstring('hello', dtype=np.uint8) - ord('a')

one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
sess = tf.InteractiveSession()

In [15]: one_hot.eval()
Out[15]: 
array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
kmario23
  • 57,311
  • 13
  • 161
  • 150
  • and why do we have to substract 97? I figured out one_hot does not work properly without that... – gregoruar Apr 17 '20 at 13:14
  • @gregoruar 'cause 97 is the ASCII code for the starting of alphabets (i.e. `a`). See this page here: https://theasciicode.com.ar/ascii-printable-characters/minus-sign-hyphen-ascii-code-45.html for more details – kmario23 Apr 17 '20 at 15:16
3

With pandas, you can use pd.get_dummies by passing a categorical Series:

import pandas as pd
import string
low = string.ascii_lowercase

pd.get_dummies(pd.Series(list(s)).astype('category', categories=list(low)))
Out: 
   a  b  c  d  e  f  g  h  i  j ...  q  r  s  t  u  v  w  x  y  z
0  0  0  0  0  0  0  0  1  0  0 ...  0  0  0  0  0  0  0  0  0  0
1  0  0  0  0  1  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
2  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
3  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0
4  0  0  0  0  0  0  0  0  0  0 ...  0  0  0  0  0  0  0  0  0  0

[5 rows x 26 columns]
ayhan
  • 70,170
  • 20
  • 182
  • 203
3

Here's a vectorized approach using NumPy broadcasting to give us a (N,26) shaped array -

ints = np.fromstring("hello",dtype=np.uint8)-97
out = (ints[:,None] == np.arange(26)).astype(int)

If you are looking for performance, I would suggest using an initialized array and then assign -

out = np.zeros((len(ints),26),dtype=int)
out[np.arange(len(ints)), ints] = 1

Sample run -

In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97

In [154]: ints
Out[154]: array([ 7,  4, 11, 11, 14], dtype=uint8)

In [155]: out = (ints[:,None] == np.arange(26)).astype(int)

In [156]: print out
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]
Divakar
  • 218,885
  • 19
  • 262
  • 358
2

You asked about "sentences" but your example provided only a single word, so I'm not sure what you wanted to do about spaces. But as far as single words are concerned, your example could be implemented with:

def onehot(ltr):
 return [1 if i==ord(ltr) else 0 for i in range(97,123)]

def onehotvec(s):
 return [onehot(c) for c in list(s.lower())]

onehotvec("hello")
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
MassPikeMike
  • 672
  • 3
  • 12