0

I am working on Named entities and their attribute extraction. Where my objective is to extract attributes associated with a particular entity in the sentence.

For example - "The Patient report is Positive for ABC disease"

In above sentence, ABC is a Entity and Positive is a Attribute defining ABC.

I am looking for an concise approach to extract the attributes, I already formulated a solution to extract entities which is working seamlessly with respectable accuracy and now working on second part of the problem statement to extract its associated attributes.

I tried extracting attributes with rule based approach which providing descent result but having following cons:

  • Source code is unmanageable.
  • Its not at all generic and difficult to manage new scenarios.
  • Time consuming.

To portray a more generic solution I explored different NLP techniques and found Dependency Tree Parsing as a potential solution.

Looking for suggestion/inputs on how to solve this problem using dependency tree parsing using Python/Java.

Feel free to suggest any other technique which could potentially help here.

Parvez Khan
  • 537
  • 7
  • 15

1 Answers1

1

I suggest to use the spacy python library because it is easy to use and has a decent dependency parser.

A baseline solution would traverse the dependency tree in a breadth-first fashion starting from your entity of interest, until it encounters a token that looks like an attribute or until it walks too far from the entity.

Further improvements to this solution would include:

  • Some rules for handling negations such as "not positive"
  • A better classifier for attributes (here I just look for adjectives)
  • Some rules about what types of dependency and what tokens should be taken into account

Here is my baseline code:

import spacy
nlp = spacy.load("en_core_web_sm")
text = "The Patient report is Positive for ABC disease"
doc = nlp(text)
tokens = {token.text:token for token in doc}

def is_attribute(token):
    # todo: use a classifier to determine whether the token is an attrubute
    return token.pos_ == 'ADJ'

def bfs(token, predicate, max_distance=3):
    queue = [(token, 0)]
    while queue:
        t, dist = queue.pop(0)
        if max_distance and dist > max_distance:
            return
        if predicate(t):
            return t
        # todo: maybe, consider only specific types of dependencies or tokens
        neighbors =  [t.head] + list(t.children)
        for n in neighbors:
            if n and n.text:
                queue.append((n, dist+1))

print(bfs(tokens['ABC'], is_attribute))  # Positive
David Dale
  • 10,958
  • 44
  • 73
  • Thank you so much for providing your valuable inputs. Definitely a good start and will start adding more content and attribute classifier to extract precise attributes. Also started looking at different patterns in parse tree to have the correct conditions in BFS code. :) – Parvez Khan Oct 30 '20 at 12:28