Modifying displacy NER annotation tool to support overlapping entities

Question

I am working on a tool to visualise NER matches and I am trying to resolve overlaps. I am using SpaCy's displacy as a template, and my idea was to add a function after displacy.render() that modifies the HTML string to match my desired output, but I am somewhat stuck. Here's my process so far:

Let's say I have some text and some entities (identified previously by some method):

text = "This is an example sentence with some super secret (and some text in between) stuff"
entities = [{"start": 38, "end": 43, "label": "SUPER"},
            {"start": 38, "end": 50, "label": "SUPER SECRET"},
            {"start": 78, "end": 83, "label": "STUFF"}]

We can render an HTML visualisation of these manually using displacy:

from spacy import displacy

example = [{"text": text,
            "ents": entities,
            "title": None}]
html = displacy.render(example, style="ent", manual=True)

which returns

<div class="entities" style="line-height: 2.5; direction: ltr">This is an example sentence with some
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    super
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">SUPER</span>
</mark>

<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    super secret
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">SUPER SECRET</span>
</mark>
 (and some text in between)
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    stuff
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">STUFF</span>
</mark>
</div>

which looks like this:

The problem here is that "super" is in a separate <mark> outside of "super secret". Ideally, I'd want to have something like this:

<div class="entities" style="line-height: 2.5; direction: ltr">This is an example sentence with
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    <mark class="entity" style="background: #add; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
        super
        <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">SUPER</span>
    </mark>
    secret
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">SUPER SECRET</span>
</mark>
 (and some text in between)
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    stuff
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">STUFF</span>
</mark>
</div>

which looks like this:

So my idea was to write an html_transform(starting_html: str) -> str function that resolves overlapping entities and returns something like the bottom image. After writing some psuedocode though, I realised that this might be a too complicated approach and I'm still stuck with it. Does anyone know of better ways of approaching this? Any pointers would be appreciated!

Modifying displacy NER annotation tool to support overlapping entities

0 Answers0