10

Python provides the NLTK library which is a vast resource of text and corpus, along with a slew of text mining and processing methods. Is there any way we can compare sentences based on the meaning they convey for a possible match? That is, an intelligent sentence matcher?

For example, a sentence like giggling at bad jokes and I like to laugh myself silly at poor jokes. Both convey the same meaning, but the sentences don't remotely match (words are different, Levenstein Distance would fail badly!).

Now imagine we have an API which exposes functionality such as found here. So based on that, we have mechanisms to find out that the word giggle and laugh do match in the meaning they convey. Bad won't match up to poor, so we may need to add further layers (like they match in the context of words like joke, since bad joke is generally same as poor joke, although bad person is not same as poor person!).

A major challenge would be to discard stuff that don't much alter the meaning of the sentence. So, the algorithm should return the same degree of matchness between the the first sentence and this: I like to laugh myself silly at poor jokes, even though they are completely senseless, full of crap and serious chances of heart-attack!

So with that available, is there any algorithm like this that has been conceived yet? Or do I have to invent the wheel?

SexyBeast
  • 7,913
  • 28
  • 108
  • 196
  • I looked into this a couple weeks ago. I'm no NLTK expert, but I think you're going to have to invent the wheel or find some fuzzy matching that's been built on top of NLTK. I couldn't find a solution, but I suspect there's something out there. I was looking to enable automatic grading of free form text responses to review questions for test prep. Do post an update if you find anything. – jimhark Feb 13 '13 at 11:35
  • Sure thing. Will do. I was thinking of some graph-based algo, would it be equal to doing this? – SexyBeast Feb 13 '13 at 11:59
  • Guys, before enthusiastically downvoting or voting for closing, at least give an explanation. Just because it is anonymous and anybody can do anything, don't get high-handed.. – SexyBeast Feb 13 '13 at 13:19
  • NLTK would help you make the graph. But I couldn't find anything ready-made to do fuzzy matching between two graphs, and creating a robust implementation would be nontrivial. – jimhark Feb 13 '13 at 19:16
  • Hmmm, I guessed as much. You two mention `fuzzy`, does that have anything to do with the field of `fuzzy logic`? – SexyBeast Feb 13 '13 at 19:33
  • 1
    They both have to do with inexact matching. "Fuzzy logic" came first (1965), so "fuzzy matching" borrowed the word "fuzzy", but I'm not sure how much it borrowed from the formal field of fuzzy logic. By "fuzzy match" is simply meant and inexact match. Specifically I was thinking about matching word stems (plus maybe synonyms) and parts of speech. – jimhark Feb 13 '13 at 19:56

1 Answers1

5

You will need a more advanced topic modeling algorithm, and of course some corpora to train your model, so that you can easily handle synonyms like giggle and laugh !

In python, you can try this package : http://radimrehurek.com/gensim/ I never used it but it includes classic semantic vector spaces methods like lsa/lsi, random projection and even lda.

My personal favourite is random projection, because it is faster and still very efficient (I'm doing it in java with another library though).

bendaizer
  • 1,235
  • 9
  • 18