Compare documents by sequence vector

Question

I'm trying to classify documents by sequence vector. Basically, I have a vocabulary (more than 5000 words). Each document is converted to a vector of integer numbers so that each element in the vector corresponds the position of the word in the vocabulary.

For example, if the vocab is [hello, how, are, you, today] and the document is "hello you" then I'll have the vector: [1 4].
Another document of "how are you" will result in [2 3 4].

Now what I want is to assess the similarity between the first and the second vector. Here you can see these vectors don't have the same length. Furthermore, comparing directly them may not make sense because they represent sequence of words. This case is different from binary (bag-of-word) vector, which considers the appearance of a word in the document (1 if appear, otherwise 0), and also frequency (word count) vector, which considers frequency of a word in the document with the given vocabulary.
Can you give me a suggestion?

probably some recipe involving [containers.Map](http://www.mathworks.com/help/matlab/map-containers.html), [union](http://www.mathworks.com/help/matlab/ref/union.html), and possibly [unique](http://www.mathworks.com/help/matlab/ref/unique.html?searchHighlight=unique) — brown.2179, Dec 09 '15 at 17:21
If it's about the method/recipe then probably better to migrate the question to [CrossValidated](http://stats.stackexchange.com/) — brown.2179, Dec 10 '15 at 15:03

score 1 · Answer 1 · answered Dec 09 '15 at 16:37

The Jaccard similarity is normally used to compare the similarity of sets (in your case, text). The text is n-grammed (shingled), and then locality sensitive hashing is used to determine their Jaccard similarity.

There is a whole field dedicated to this - Google is your friend!

Compare documents by sequence vector

1 Answers1