0

Lets say an internet user searches for "trouble with gmail".

How can I return entries with "problem|problems|issues|issue|trouble|troubles with gmail|googlemail|google mail"?

I don't like to manually add these linkings between different keywords so the links between "issue <> problem <> trouble" and "gmail <> googlemail <> google mail" are completly unknown. They should be found in an automated process.

Approach to solve the problem
I provide a synonyms/thesaurus plattform like thesaurus.com, synonym.com, etc. or use an synomys database/api and use this user generated input for my queries on a third website.

But this won't cover all synonyms like the "gmail"-example.

Which other options do I have? Maybe something based on the given data and logged search phrases of the past?

mgutt
  • 5,867
  • 2
  • 50
  • 77

5 Answers5

1

This is a bit long for a comment.

What you are looking for is called a "thesaurus" or "synonyms" list in the world of text searching. Apparently, there is a proposal for such functionality in MySQL. It is not yet implemented. (Here is a related question on Stack Overflow, although the link in the question doesn't seem to work.)

The work-around would be to modify queries before sending them to the database. That is, parse the query into words, then look up all the synonyms for those words, and reconstruct the query. This works better for the natural language searches than the boolean searches (which require more careful reconstruction).

Pseudo-code for getting the final word list with synonyms would be something like:

select @finalwords = concat_ws(' ', group_concat(synonyms separator ' ') )
from synonyms s
where find_in_set(s.baseword, @words) > 0;
Community
  • 1
  • 1
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • The query itself is not the problem. My problem is how I'm able to automatically generate a "synonyms"-list. I'll update my question. – mgutt Feb 27 '15 at 13:56
  • @mgutt . . . You don't. Synonyms are usually manual and domain specific, although undoubtedly there are lists available online. – Gordon Linoff Feb 28 '15 at 00:15
  • I've udpated my question. I want synomyns that are not part of a database, too. Like brands, product names, etc. – mgutt Mar 02 '15 at 13:48
1

Seems to me that you have two problems on your hands:

  1. Lemmatisation, which breaks words down into their lemma, sometimes called the headword or root word. This is more difficult than Stemming, as it doesn't just chop suffixes off of words, but tries to find a true root, e.g. "are" => "be". This is something that is often done programatically, although it appears to be a complex task. Here is an online example of text being lemmatized: http://lemmatise.ijs.si/Services

  2. Searching for synonymous lemmas. This is a very complex problem. One approach to this that I have heard of is modifying the lemmatisation engine to return more than one lemma for a given set of words, i.e. "problems" => "problem" and "issue", thereby allowing a more flexible set of results. However, this means that the synonymous lemmas must be provided to the lemmatisation engine from elsewhere. I truly have no idea how you would build a list of synonyms programatically.

So, you may consider a strategy whereby you lemmatise the text to be searched for, then pass each lemma out to your synonym finder (however that works) to get a final list of lemmas to perform your search with.

I think you have bitten off a very large problem for yourself.

Paul Griffin
  • 2,416
  • 15
  • 19
1

You have to think of it ignoring the language.

When you show a baby the same thing using two words, he understand that those words are synonym. He might not have understood perfectly, but he will learn when this is repeated.

You type "problem with gmail".

Two choices:

  1. Your search give results: you click on one item.

The system identify that this item was already clicked before when searching for "google mail bug". That's a match, and we will call it a "relative search".

  1. Your search give poor results:

We will search in our history for a matching search: We propose : "do you mean trouble with yahoo mail? yes/no". You click no, that's a "no match". And we might propose others suggestions like a list of known "relative search" or a list of might be related playing with both full text search in our history and levenshtein distance.

When a term is sufficiently scored to be considered as a "synonym", you can consider it is. Algorithm might be wrong, but in fact it depends on what you really expect.

If i search "sending a message is difficult with google", and "gmail issue", nothing is synonym, but search are relatively the same. This is more important to me than true synonyms.

And if you really want to get the synonym, i would do it in a second phase comparing words inside "relative searches" and would include a manual check.

I think google algorithm use synonym mainly to highlight search terms in page result, but not to do an actual search where they use the relative search terms, except in known situations, as the result for "gmail" and "google mail" are not the same.

But if you identify 10 relative searches for "gmail" which all contains "google mail", that will be a good start point to guess they are synonyms.

Adam
  • 17,838
  • 32
  • 54
  • Ok, you think a search engine like Google learned from users through the "similar search"-proposals to find out what are "relative" synonyms and which not. This is interesting and could be a possible solution. I think you have given the best answer to solve my problem. Thank you! – mgutt Mar 15 '15 at 19:52
  • Google learn a lot from users' experience. And developper like us can learn a lot about search engine from Google's experience. Of course they have a lot of servers, but they have to make pragmatic choices to extract the best result in a short time from billions of data, and in many languages. – Adam Mar 15 '15 at 21:37
0

If the system in question is a publicly accessible website, one 'out there' option is to ensure all content can be crawled by Google and then use a Google search on your own site, which should give you the synonym capability 'for free'. There would obviously be some vagaries in the results though and lag in getting match results for newly created content, depending upon how regularly the crawlers hit the site. Probably not suitable in your use case, but for some people, this may be sufficient.

John Rix
  • 6,271
  • 5
  • 40
  • 46
  • Google is not free: https://www.google.com/work/search/products/ Maybe you mean the Adsense search results, but this is no option for me. – mgutt Feb 27 '15 at 12:48
  • I was referring to the Site Search option rather than the appliance... I haven't followed through to the details of it bar the 'Sign Up' link, but there is no mention of a cost associated with that. Incidentally, you can search a specific site from any Google search box simply by prefixing the search with 'site:' - you could potentially do this from your own page... not sure. – John Rix Feb 27 '15 at 13:18
  • By the way, the 'for free' in my answer was not a reference specifically to Google's services having no cost (though I was not considering that either way), but rather the fact that Google does synonym searches for you automatically. – John Rix Feb 27 '15 at 13:21
  • Ok, but finally I don't want to share all my content or be dependent from a third party service. – mgutt Feb 27 '15 at 13:52
  • Yep, agreed. It isn't for everyone by any means, but it does at least address the core question you were asking, which is why I posted it for general reference. – John Rix Feb 27 '15 at 14:17
0

Seeing your revised question, what about using a public API?

http://www.programmableweb.com/category/reference/apis?category=20066&keyword=synonym

John Rix
  • 6,271
  • 5
  • 40
  • 46
  • I've udpated my question. I want synomyns that are not part of a database, too. Like brands, product names, etc. – mgutt Mar 02 '15 at 13:48
  • That's a tough ask IMO. I don't know if Google or other search engines provide an API for fetching 'similar search strings' or not, but I would say trying to roll your own could be challenging, especially without a sufficiently huge sample set to work from if nothing else. Perhaps you have some means to crowd-source such relationships though? – John Rix Mar 02 '15 at 14:48