One Hot Encoding for representing corpus sentences in python

Question

I am a starter in Python and Scikit-learn library. I currently need to work on a NLP project which firstly need to represent a large corpus by One-Hot Encoding. I have read Scikit-learn's documentations about the preprocessing.OneHotEncoder, however, it seems like it is not the understanding of my term.

basically, the idea is similar as below:

1000000 Sunday; 0100000 Monday; 0010000 Tuesday; ... 0000001 Saturday;

if the corpus only have 7 different words, then I only need a 7-digit vector to represent every single word. and then, a completed sentence can be represented by a conjunction of all the vectors, which is a sentence matrix. However, I tried in Python, it seems not working...

How can I work this out? my corpus have a very large amount of different words.

Btw, also, seems like if the vectors are mostly fulfilled with zeros, we can use Scipy.Sparse to make the storage small, for example, CSR.

Hence, my entire question will be:

how the sentences in corpus can be represented by OneHotEncoder, and stored in a SparseMatrix?

Thank you guys.

score 6 · Accepted Answer · answered May 21 '15 at 00:21

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix.

Example code for two simple documents A B and B B:

from sklearn.preprocessing import OneHotEncoder
import itertools

# two example documents
docs = ["A B", "B B"]

# split documents to tokens
tokens_docs = [doc.split(" ") for doc in docs]

# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_docs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}

# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_doc] for tokens_doc in tokens_docs]

# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)

print X.toarray()

Prints (one hot vectors in concatenated form per document):

[[ 1.  0.  0.  1.]
 [ 0.  1.  0.  1.]]

How do you handle the situation when docs = ["A B", "B B C"], for example working with different tweets, they don't always have the same length and contain different words. — bmc, Apr 06 '17 at 15:05
one common approach is to set a max length and pad shorter texts with a padding character and cut off longer texts — fotis j, May 04 '19 at 12:31

One Hot Encoding for representing corpus sentences in python

1 Answers1

Linked