0

I have the following two text:

text0 = "AAAAAAAAAAAA";

text1 = "AAAAABAAAAAA";

I use 4-shingle. Thus, text0 = {AAAA}, text1 = {AAAA, AAAB, AABA, ABAA, BAAA}.

Then, the Jaccard similarity is sim = 1/5 = 0.2.


I do not want this result. Because the two text seems having high similar.

I want to use bag similarity as following:

text0 = {AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA},

text1 = {AAAA, AAAA, AAAB, AABA, ABAA, BAAA, AAAA, AAAA, AAAA}.

If use this two bags, its similar is sim = 5/9. This is much high than 0.2.

Does MinHash can do this one?

Yuansheng liu
  • 165
  • 1
  • 2
  • 10

2 Answers2

1

For bags you can use weighted minwise hashing, see

S. Ioffe, Improved consistent sampling, weighted minhash and l1 sketching, 2010

or

A. Shrivastava, Simple and Efficient Weighted Minwise Hashing, 2016.

If the multiplicities are always small integral numbers you could also use unweighted min-wise hashing by making entries unique, e.g. through numbering:

text0 = {AAAA1, AAAA2, AAAA3, AAAA4, AAAA5, AAAA6, AAAA7, AAAA8, AAAA9},

text1 = {AAAA1, AAAA2, AAAB1, AABA1, ABAA1, BAAA1, AAAA3, AAAA4, AAAA5}.

otmar
  • 386
  • 1
  • 9
  • Thank you so much. I will have a look at these two papers. – Yuansheng liu Sep 12 '17 at 07:49
  • Making entries unique by numbering is a bad idea. That would mean no similarity is detected between "ABCDEFGHIJKLMNOPQRSTUVWXYZ" and "BCDEFGHIJKLMNOPQRSTUVWXYZ". – Ben Whitmore Sep 26 '17 at 01:01
  • For your example we would have text0 = {ABCD1, BCDE1, CDEF1,...} text1 = {BCDE1, CDEF1, DEFG1,...} which clearly have common elements. – otmar Sep 26 '17 at 04:39
  • Ah, I see what you're doing. That's sensible. Note that by this approach, when comparing two similar documents, one of which duplicates some of the shared text and the other of which doesn't, the similarity score will be more severely penalized by that duplication than it would in the traditional approach. That may be desirable though; it really depends on what you consider "similarity" to mean. – Ben Whitmore Jul 11 '19 at 02:38
0

Another simple solution to improve your similarity score with very short texts is to also generate shorter shingles at beginning and end of document, using a special character to indicate beginning/end.

In this case, your shingles generated from text0 are: {@A, @AA, @AAA, AAAA, AAA@, AA@, A@}

and those from text1 are: {@A, @AA, @AAA, AAAA, AAAB, AABA, ABAA, BAAA, AAA@, AA@, A@}.

Jaccard similarity is now 7/11 = 0.64

This really comes down to a philosophical question about what "similarity" means to you: which features do you or don't you consider important to include?

Ben Whitmore
  • 857
  • 1
  • 6
  • 15