I'm trying to find all types of references inside a text, such as "Appendix 2", "Section 17" or "Schedule 12.2", using python. The issue after finding such matches is that some of them overlap and I would like to join them in a new string or just consider the longest one, removing substrings.
To do so, I've created multiple regex patterns such that the code is more readable and then I've inserted them in a list, calling finditer over all patterns in the list. From the matches, I gather both the text and the position inside the text as start and end index.
def get_references(text):
refs = [{
'text': match.group(),
'span': {
'start': match.span()[0],
'end': match.span()[1]
}}
for ref in references_regex for match in finditer(ref, text)]
This implies that a reference matched by multiple patterns is still inserted in the results multiple times, despite being the same or with little variants (e.g. "Section 17.4" and "Section 17.4 of the book" and "17.4 of the book").
I've tried to merge overlapping patterns with some ad hoc functions, but still don't work properly.
Do you know if there's a way to remove duplicates or merge them in case they overlap?
For instance, I have:
[{"text": "Schedule 15.1", "span": {"start": 756, "end": 770}},
{"text": "15.1 of the Framework Agreement", "span": {"start": 765, "end": 796}},
{"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}]
I would like to get:
{"text": "Schedule 15.1 of the Framework Agreement", "span": {"start": 756, "end": 796}},
{"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}]
Thank you in advance!