0

I'm trying to classify documents by sequence vector. Basically, I have a vocabulary (more than 5000 words). Each document is converted to a vector of integer numbers so that each element in the vector corresponds the position of the word in the vocabulary.

For example, if the vocab is [hello, how, are, you, today] and the document is "hello you" then I'll have the vector: [1 4].
Another document of "how are you" will result in [2 3 4].

Now what I want is to assess the similarity between the first and the second vector. Here you can see these vectors don't have the same length. Furthermore, comparing directly them may not make sense because they represent sequence of words. This case is different from binary (bag-of-word) vector, which considers the appearance of a word in the document (1 if appear, otherwise 0), and also frequency (word count) vector, which considers frequency of a word in the document with the given vocabulary.
Can you give me a suggestion?

lenhhoxung
  • 2,530
  • 2
  • 30
  • 61
  • probably some recipe involving [containers.Map](http://www.mathworks.com/help/matlab/map-containers.html), [union](http://www.mathworks.com/help/matlab/ref/union.html), and possibly [unique](http://www.mathworks.com/help/matlab/ref/unique.html?searchHighlight=unique) – brown.2179 Dec 09 '15 at 17:21
  • Well, I think it's about the method we use – lenhhoxung Dec 09 '15 at 22:07
  • 1
    If it's about the method/recipe then probably better to migrate the question to [CrossValidated](http://stats.stackexchange.com/) – brown.2179 Dec 10 '15 at 15:03
  • u're right, I'll move to that site – lenhhoxung Dec 10 '15 at 16:24

1 Answers1

1

The Jaccard similarity is normally used to compare the similarity of sets (in your case, text). The text is n-grammed (shingled), and then locality sensitive hashing is used to determine their Jaccard similarity.

There is a whole field dedicated to this - Google is your friend!

RPM
  • 1,704
  • 12
  • 15