0

Given a query log of about 10 million queries I have to write a program that will ask query from the user and display most similar 10 queries to the input query as a output. Also in case of spelling mistakes it may suggest the correct spellings.

In this context I have studied a few tutorials on Locality Sensitive Hashing but can not understand how can I apply it in this problem. First I was thinking of sorting the log lexicographically. But I don't think it will be good idea to sort the log as far as size of the log is concerned as it may not be efficient to load the whole log into memory.

So can please anyone suggest me any idea to approach the problem. Thank you.

Joy
  • 4,197
  • 14
  • 61
  • 131

1 Answers1

0

You would definitely want to look at this if you want to parallelize the processing. Minhash Clustering in Mahout

  1. Generate shingles (n-grams with appropriate n)
  2. Generate MinHash
  3. Run LSH

Very detailed information on LSH can be found here: Mining Massive Datasets

jaguarpaw
  • 122
  • 7