5

Is there a way for NLP parsers to identify a list?
For example, "a tiger, a lion and a gorilla" should be identified as a list
(I don't need it to be identified as a list of animals; just a list would be sufficient).

My ultimate aim is to link a common verb/word to all the items in the list. For example, consider the sentence "He found a pen, a book and a flashlight". Here, "found" verb should be linked to all the 3 items.

Another example, "He was tested negative for cancer, anemia and diabetes". Here, the word "negative" should be linked to the three diseases.

Is this possible with any of the open-source NLP packages like OpenNLP or Stanford CoreNLP? Any other solution?


EDIT:
Like mentioned in one of the answers, my initial idea was to manually parse the list and find the items by looking at the placement of commas, etc.

But then I discovered Stanford NLP's OpenIE model. This seems to be doing a pretty good job.
For example, "He has a pen and a book" gives the 2 relations (He;has;a pen) and (He;has;a book).

The problem with the model is that it doesn't work for incomplete sentences like, "has a pen and a book".
(From what I understood, this is because OpenIE can only extract triples)
It also fails when negations are involved. Eg, "He has no pens".

Is there a solution to these problems? What are the best solutions available currently for information extraction?

  • Since you mention parsing and Stanford NLP: Have you checked how [Stanford's online version of the parser](http://nlp.stanford.edu:8080/parser/) treats the "lists"? They're grouped together as a noun phrase, and this noun phrase is attached to the verb. (Getting this information out of the dependencies is a bit more involved, though.) – lenz May 22 '17 at 14:20

2 Answers2

3

I'm afraid the full answer could fill the better part of a PhD thesis :)

There are no generic tools to do what you need. You will need to write it yourself. If you look at this example, you can see that you can extract the list by starting from the token and or the comma and then traversing the graph around it to build the list. In this particular case you can look at the conj and appos relations that link smaller noun phrases.

You could also look at POS tag patterns like (N*, ,, N*, CC, N*) -- this is a hack but it's probably your best approach if you want fast results and you are willing to miss out on recall.

As for your requirement to include modifiers such as negation -- this is a separate task that should come after you've identified the list.

Aleksandar Savkov
  • 2,894
  • 3
  • 24
  • 30
  • Thank you for your reply! Yes, I was also initially thinking about this approach. But I was kind of hesitant because I don't know much about parsing or the tag and dependency meanings. I found another solution which seems to kind of work for me. Please see the edit for updated info. –  May 31 '17 at 22:47
2

What you are trying to do is called Information Extraction.

In your case, the task is to extract basic propositions about a set of entities (given as an enumeration) instead of just one entity (which is the usual scenario). For example, you want to extract the following three propositions from the sentence He found a pen, a book and a flashlight.:

  • find(X, pen)
  • find(X, book)
  • find(X, flashlight)

X stands for the entity referred to as He. As Mr. Savkov already pointed out, information extraction is a quite hard problem whose solution lies beyond a Stack Overflow answer.

There are many approaches to information extraction. As suggested by Mr. Savkov, a solution based on POS tags might be a good starting point. I suggest taking a look at this nice tutorial based on NLTK (especially section 2.2. "Tag Patterns") and this paper.

zepp133
  • 1,542
  • 2
  • 19
  • 23
  • Hi seble! It looks like Stanford CoreNLP's OpenIE model kind of does the job for me. But it has certain limitations which are relevant to me. Please see the updated question for more info. –  May 31 '17 at 22:49
  • I wanted to know more about information extraction. Is Stanford's OpenIE same as the one offered by University of Washington on [GitHub](https://github.com/knowitall/openie). Are there other IE models that are worth checking? Thanks! –  May 31 '17 at 22:52