I am a starter in Python and Scikit-learn library. I currently need to work on a NLP project which firstly need to represent a large corpus by One-Hot Encoding. I have read Scikit-learn's documentations about the preprocessing.OneHotEncoder, however, it seems like it is not the understanding of my term.
basically, the idea is similar as below:
- 1000000 Sunday; 0100000 Monday; 0010000 Tuesday; ... 0000001 Saturday;
if the corpus only have 7 different words, then I only need a 7-digit vector to represent every single word. and then, a completed sentence can be represented by a conjunction of all the vectors, which is a sentence matrix. However, I tried in Python, it seems not working...
How can I work this out? my corpus have a very large amount of different words.
Btw, also, seems like if the vectors are mostly fulfilled with zeros, we can use Scipy.Sparse to make the storage small, for example, CSR.
Hence, my entire question will be:
how the sentences in corpus can be represented by OneHotEncoder, and stored in a SparseMatrix?
Thank you guys.