Finding and suggesting most similar queries from a query log

Question

Given a query log of about 10 million queries I have to write a program that will ask query from the user and display most similar 10 queries to the input query as a output. Also in case of spelling mistakes it may suggest the correct spellings.

In this context I have studied a few tutorials on Locality Sensitive Hashing but can not understand how can I apply it in this problem. First I was thinking of sorting the log lexicographically. But I don't think it will be good idea to sort the log as far as size of the log is concerned as it may not be efficient to load the whole log into memory.

So can please anyone suggest me any idea to approach the problem. Thank you.

score 0 · Answer 1 · answered Feb 20 '14 at 01:51

You would definitely want to look at this if you want to parallelize the processing. Minhash Clustering in Mahout

Generate shingles (n-grams with appropriate n)
Generate MinHash
Run LSH

Very detailed information on LSH can be found here: Mining Massive Datasets

Finding and suggesting most similar queries from a query log

1 Answers1