I'm learning classification. I read about using vectors. But I can't find an algorithm to translate a text with words to a vector. Is it about generating a hash of the words and adding a 1 to the hash location in the vector?
2 Answers
When most people talk about turning text into a feature vector, all they mean is recording the presence of the word (token).
Two main ways to encode a vector. One is explicit, where you have a 0
for each word that is not present (but is in your vocabulary). The other way is implicit---like a sparse matrix (but just a single vector)---where you only encode terms with a frequency value >= 1
.
Bag of words model
The main article that explains this the best is most likely the bag of words model, which is used extensively for natural language processing applications.
Explicit BoW vector example:
Suppose you have the vocabulary:
{brown, dog, fox, jumped, lazy, over, quick, the, zebra}
The sentence "the quick brown fox jumped over the lazy dog"
could be encoded as:
<1, 1, 1, 1, 1, 1, 1, 2, 0>
Remember, position is important.
The sentence "the zebra jumped"
---even though it is shorter in length---would then be encoded as:
<0, 0, 0, 1, 0, 0, 0, 1, 1>
The problem with the explicit approach is that if you have hundreds of thousands of vocabulary terms, each document will also have hundreds of thousands of terms (with mostly zero values).
Implicit BoW vector example:
In this case, the sentence "the zebra jumped"
could be encoded as:
<'jumped': 1, 'the': 1, 'zebra': 1>
where the order is arbitrary.

- 3,720
- 4
- 24
- 42
If you are learning classification I would start with the easier and more intuitive bag of words representation of your text.
If you are however interested in using a feature hashing method, particularly if you have a large set of data, I would suggest this article which describes the use of hashing in text representation and classification.

- 81
- 1
- 5
-
I'm reading the book Mahout in Action. There they use the TextValueEncoder. They show it returns a vector. But it wasn't clear how the function came to the vector. By reading the above answer it became clear. Only thing left is how do they fit the above vector in a vector with 100 terms even when you have thousands of features. – broersa Jun 13 '13 at 07:10
-
There are different methods to narrow it down to 100 terms. That's called feature selection. You need to first score the features given a test set. One way to score them is to us a chi squared test. It's talked about [here](http://stackoverflow.com/questions/14573030/perform-chi-2-feature-selection-on-tf-and-tfidf-vectors) and [here](http://scikit-learn.org/0.13/modules/feature_selection.html). I am trying to find better documentation but no luck. I am sure your book has a section on that...? – junkaholik Jun 13 '13 at 20:46
-
1The link is dead – Zoe Nov 02 '17 at 17:52