I have the following two text:
text0 = "AAAAAAAAAAAA";
text1 = "AAAAABAAAAAA";
I use 4-shingle. Thus, text0 = {AAAA}, text1 = {AAAA, AAAB, AABA, ABAA, BAAA}.
Then, the Jaccard similarity is sim = 1/5 = 0.2.
I do not want this result. Because the two text seems having high similar.
I want to use bag similarity as following:
text0 = {AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA, AAAA},
text1 = {AAAA, AAAA, AAAB, AABA, ABAA, BAAA, AAAA, AAAA, AAAA}.
If use this two bags, its similar is sim = 5/9. This is much high than 0.2.
Does MinHash can do this one?