What does the bucket_count setting correspond to? Does this mean that the minhashes are further hashed to values between 1 and bucket_count-1?
Would generating minhashes in the following scenario result in any speedup?
Case: Index 10 million documents where each document is just a set of feature indices. The total number of possible indices is 10000. So a document could look like A={1,5,7,500,750...9800} Moreover, all the documents/sets will be of fixed length (let's say it is 196). In this case, retrieving documents most similar to a document A would mean running through all 10 million documents to find those that had the most overlap of indices.
Will using minhashes speed up the above similarity retrieval? The reason this is confusing is that the original documents/sets are all fairly small -- 196 features.
Minhash tokenization with the default bucket size of 528 would generate a token set that's 528 long -- which is longer than the original document (which is 196, as described above)
In such a scenario, would minhash actually help speed up retrieval in any way?