Extracting related text given a sentence, keywords or topic

Question

Are there any known ways (above and beyond statistical analysis, but not necessarily excluding it as being part of the solution) to relate sentences or concepts to one another using Natural Language Processing. Thus far I've only worked with NLTK and Stanford-NLP to aid in my project, but I am open to alternative open source solutions.

As an example take the following George Orwell essay (http://orwell.ru/library/essays/wiw/english/e_wiw). Suppose I gave the application the sentence

"What are George Orwell's opinions on writers."

or perhaps

"George Orwell believes writers enjoy writing to express their creativity, to make a point and for their egos."

Might yield lines from the essay like

"The aesthetic motive is very feeble in a lot of writers, but even a pamphleteer or writer of textbooks will have pet words and phrases which appeal to him for non-utilitarian reasons; or he may feel strongly about typography, width of margins, etc."

or

"Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money."

I understand that this is not easy and I may not achieve much accuracy, but I was hoping for ideas on what already exists and what I could try to start off, or at least get the best results possible based on what is already known and out there.

score 1 · Answer 1 · answered Oct 28 '13 at 19:58

The simplest way of doing this might be using some distance functions (such as Cosine similarity) between your query sentence and the sentence pool. It's easy to implement. Create a vocabulary from the text collection and each sentence is represented as a vector. You can use TF-IDF to represent values in the vector, and calculate the cosine similarity between sentences, and get the highest scored sentence with respect to your query sentence.

Or you can build index from your corpus and use for example Lucene and let it do the work for you.

You may also consider using LSA (Latent Semantic Analysis) where you can get the similarity between sentences.

I've tried some of those approaches, but they are most "bag of words" techniques. I guess I'm more interested in knowing if there are any known approaches thus far in trying to understand meaning in sentences beyond matching words. I know it's a difficult and deeply studied field, I've just been having trouble pinpointing where it stands at this point. Thanks for the response! — user2926522, Oct 30 '13 at 00:35

score 0 · Answer 2 · answered Oct 30 '13 at 13:57

From what I understand from your question (and also your comment) is you are more interested in understanding the meaning of individual sentence and then equate with each other in proximity. Statistical approach, in my opinion, is more for "getting a feel" of the sentence than understanding it. In my opinion I would suggest deep parsing approach.

Deep parse the sentence, understand what roles the words play in the sentence, understand what the subject-verb-object model (left to right parsing and such techniques) and then have a vocabulary that helps you categorise the nouns and verbs.

e.g.

"Serious writers, I should say, are on the whole more vain and self-centered than journalists, though less interested in money."

Parsing this sentence, lets you understand the subject of the sentence is "serious writers" (serious being an adjective, writers basically). In the verb form it states "are" (current state) and "interested". Each verb then points to some more vocabulary including adjectives. If you arrange this vocabulary in correct manner (and keep building it) I think you should get somewhere with your problem.

Extracting related text given a sentence, keywords or topic

2 Answers2