0

We're given two strings that act as search queries. We need to determine if they're the same. For example:

Query 1: stock price rate

Query 2: share cost rate

We're also given a list containing where each entry has two words that are synonyms. the words could be repeated meaning a transitive relation exists. Something like this:

[

[cost,price]

[rate,price]

[share,equity]

]

Goal is determine whether the queries mean the same thing.

I've proposed a solution where i group similar meaning words into lists and doing an exhaustive search until we find the word from query1 and then searching it's group for word from query 2. But the interviewer wanted a more efficient approach which i couldn't figure out. Is there a more efficient way to solve this issue?

Prune
  • 76,765
  • 14
  • 60
  • 81
NoobScript
  • 47
  • 1
  • 8
  • How about union-find? – harmands Feb 06 '20 at 18:31
  • 1
    Give each meaning an identifier, then build a hash table {word -> meaning_id} (so that synonym will be returned the same 'meaning id'). You can then tell if 2 queries are similar in `O(nb words in queries)`, with precomputing of `O(total number of words)` – m.raynal Feb 06 '20 at 18:33
  • What about repetition in the input? For example, your dictionary says that price is equivalent to rate; so would "stock price rate" match "stock price"? Or more simply, would "stock price price" match "stock price"? What about order? Is "stock price" a match for "price stock"? – erickson Feb 06 '20 at 22:06
  • @erickson we'll probably have to deal with duplicates using hash set like data structure, and since(I'm assuming this) search engines use individual keywords to form results, the order does not matter. – NoobScript Feb 08 '20 at 18:00
  • My question was for clarification on the requirements. From what I can glean from your response, the query should be considered an unordered set of keywords. Real search engines vary, but many search interfaces treat keyword order as a factor in scoring results. It’s less common for them to expose the ability to boost a keyword’s weight to the user, but a naive implementation might boost a score due to repeated keywords. As an interviewer, I’d be disappointed if a candidate didn’t ask for this kind of clarification, and touch on some of these issues. – erickson Feb 08 '20 at 18:07

1 Answers1

2

Here is a solution that would allow to tell if 2 queries are similar in near constant time (O(size of queries)), with precomputing in O(number of words in database).

Precomputing: We assume that you have a list of lists of synonyms L

function build_hashmap(L):
    H <- new Hashmap()
    i <- 0
    for each  synonyms_list in L do:
        for each word in synonyms_list do:
            H[word] <- i
        i <- i+1
    return H

Now we can test if two words are synonyms using H

function is_synonym(w1, w2, H):
    if H[w1] == H[w2]:
        return true
    else:
        return False

From there it should be rather easy to tell if two queries have the same meaning.

Edit:
A fast solution could be to implement 'union-find' algorithm in order to build the hashmap.

Another way would be to first model the words as vertices of a graph, and to add edges for relations of synonymity. Then you can build your hashmap by finding the connected components of the graph. Finding connected components in a graph can be done by traversing it.

m.raynal
  • 2,983
  • 2
  • 21
  • 34
  • 1
    This doesn't handle transitive relationships in synonyms lists like [a,b],[c,d],[a,c]. You end up with H[b]!=H[d]. – Matt Timmermans Feb 07 '20 at 02:12
  • @NoobScript, you're free to assume that if you want, but it certainly doesn't say so, and it wouldn't really make sense to do this hash mapping if you already implemented union-find. – Matt Timmermans Feb 08 '20 at 18:52
  • I can confirm what @MattTimmermans said: the algorithm I proposed does not handle transitive relationships, and would lead to incorrect results (I did not pay enough attention to the problem's description). Implementing union-find, or modeling the words as a graph and finding its connected components would give correct results. – m.raynal Feb 08 '20 at 21:12