0

How might I proceed to find relationship between two totally different but related phrases. For example: 1) "Social media websites of today..." 2) "Facebook is extremely popular social networking site..."

While these two phrases does not really have much words in common, they are related(being that Facebook is a social media website of today). How can I quantify this relation(if its even possible) ?

oneCoderToRuleThemAll
  • 834
  • 2
  • 12
  • 33
  • Still not sure about the problem. Am I right if I assume that you are looking to find similarities or connections between entities? Or even the phrases are to be equated? – rishi Nov 26 '13 at 13:57
  • @rishi Sorry for the lack of clarity. I am trying to find a relationship between the two phrases that is not necessarily the similarity or physical connectivity based on exact matching terms. Rather, the idea is to find link between the phrases as a person might: based on external information and inference... – oneCoderToRuleThemAll Nov 26 '13 at 14:33

2 Answers2

4

Simple, ineffective way: compute the number of words in common (and/or the words themselves), or the edit-distance between the two sentences but using words rather than characters. In this case, it would pick up that the word "social" appears in both sentences. You could also find a way to detect synonyms, such as "websites" and "site", using some thesaurus data. This might take some work. Common words ("and", "the", ...) could be disregarded, to reduce the chance of coincidental matches.

Refinement: Maintain some kind of graph of links between words (e.g. "Facebook" and "networking"), base the weight of the links between words on how often they occur together, and base your metric of relatedness on that. Maintain a list of words which occur too often, and disregard them. Obviously this depends on having some representative "training data" for your algorithm.

Complicated, effective way: read up on machine learning.

user234461
  • 1,133
  • 12
  • 29
3

This is a very generic problem and you will have to employ multiple approaches to get any respectable results. In fact what you are talking about is the ultimate goal of NLP. I suggest you break down the problem into pieces and address each piece one by one.

First piece of the puzzle is to understand if two sentences are talking about the same/similar entities. This can be done by identifying subjects, objects, verbs, location references, instrumentative references, dative references etc. in different sentences. These references then can be compared to each other. One way that comes to my mind is to look at the wordnet distance. You will have to build your vocabulary over a period of time.

The second piece of the puzzle is to then address the ethos of the sentence. You will have to employ machine learning approach here along with linguistics.

As I said, it is a VERY generic problem and thus quite difficult to solve in one go. If I were you I would address problem in following manner:

Step 1. Start by restricting my solution to one domain. This will help me build better ontology/vocabulary, train my models better.

Step 2: Resolve entity proximity and try and understand which sentences are talking about similar subjects or are pointing to similar objects etc. This step is more of a linguistic problem

Step 3: With the help of machine learning try and find sentences which have similar ethos and tonality.

Step 4: Move to next domain and repeat the steps.

Hope this helps.

rishi
  • 2,564
  • 6
  • 25
  • 47