I am working on a tool to visualise NER matches and I am trying to resolve overlaps. I am using SpaCy's displacy as a template, and my idea was to add a function after displacy.render()
that modifies the HTML string to match my desired output, but I am somewhat stuck. Here's my process so far:
Let's say I have some text
and some entities
(identified previously by some method):
text = "This is an example sentence with some super secret (and some text in between) stuff"
entities = [{"start": 38, "end": 43, "label": "SUPER"},
{"start": 38, "end": 50, "label": "SUPER SECRET"},
{"start": 78, "end": 83, "label": "STUFF"}]
We can render an HTML visualisation of these manually using displacy:
from spacy import displacy
example = [{"text": text,
"ents": entities,
"title": None}]
html = displacy.render(example, style="ent", manual=True)
which returns
<div class="entities" style="line-height: 2.5; direction: ltr">This is an example sentence with some
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
super
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">SUPER</span>
</mark>
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
super secret
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">SUPER SECRET</span>
</mark>
(and some text in between)
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
stuff
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">STUFF</span>
</mark>
</div>
The problem here is that "super" is in a separate <mark>
outside of "super secret". Ideally, I'd want to have something like this:
<div class="entities" style="line-height: 2.5; direction: ltr">This is an example sentence with
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
<mark class="entity" style="background: #add; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
super
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">SUPER</span>
</mark>
secret
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">SUPER SECRET</span>
</mark>
(and some text in between)
<mark class="entity" style="background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
stuff
<span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">STUFF</span>
</mark>
</div>
So my idea was to write an html_transform(starting_html: str) -> str
function that resolves overlapping entities and returns something like the bottom image. After writing some psuedocode though, I realised that this might be a too complicated approach and I'm still stuck with it. Does anyone know of better ways of approaching this? Any pointers would be appreciated!