0

I'm trying to find all types of references inside a text, such as "Appendix 2", "Section 17" or "Schedule 12.2", using python. The issue after finding such matches is that some of them overlap and I would like to join them in a new string or just consider the longest one, removing substrings.

To do so, I've created multiple regex patterns such that the code is more readable and then I've inserted them in a list, calling finditer over all patterns in the list. From the matches, I gather both the text and the position inside the text as start and end index.

def get_references(text):
    refs = [{
        'text': match.group(),
        'span': { 
            'start': match.span()[0],
            'end': match.span()[1]
    }} 
        for ref in references_regex for match in finditer(ref, text)]  

This implies that a reference matched by multiple patterns is still inserted in the results multiple times, despite being the same or with little variants (e.g. "Section 17.4" and "Section 17.4 of the book" and "17.4 of the book").

I've tried to merge overlapping patterns with some ad hoc functions, but still don't work properly.

Do you know if there's a way to remove duplicates or merge them in case they overlap?

For instance, I have:

[{"text": "Schedule 15.1", "span": {"start": 756, "end": 770}},
 {"text": "15.1 of the Framework Agreement", "span": {"start": 765, "end": 796}},
 {"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}]

I would like to get:

 {"text": "Schedule 15.1 of the Framework Agreement", "span": {"start": 756, "end": 796}},
 {"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}]

Thank you in advance!

  • "I've tried to merge overlapping patterns with some ad hoc functions, but still don't work properly." can you elaborate? What did you try? What are your patterns? What functions are you talking about? I think merging your patterns to a single one could be the key here... – Tranbi Feb 14 '23 at 09:17
  • I have tried creating a function that tests whether a match is a substring of another, based on the indexes. If so, we then merge them and save only their merge. But I had a problem storing also the matches with no duplication and multiple versions of the merge were provided. – Aurora Arctic Feb 15 '23 at 15:45

2 Answers2

1

Your problem is called merging intervals. You can checkout the problem in leetcode and read the solutions part.

You could try my code, this code implements the solution for your specific problem. It might have bug since I haven't tested with a bigger dataset.

Edit: Please note that your list should be sorted in ascending order

def process(match_list):
    if not match_list:
        return []

    new_list = []
    new_text = match_list[0]['text']
    start, end = match_list[0]['span']['start'], match_list[0]['span']['end']

    for i in range(1, len(match_list)):
        # If overlap
        if end >= match_list[i]['span']['start']:
            # Merge the text and update the ending position
            new_text += match_list[i]['text'][end-match_list[i]['span']['start']-1:]
            end = max(end, match_list[i]['span']['end'])
        else:
            # If not overlap, append the text to the result
            new_list.append({'text': new_text, 'span': {'start': start, 'end': end}})
            # Process the next text
            new_text = match_list[i]['text']
            start, end = match_list[i]['span']['start'], match_list[i]['span']['end']

    # Append the last text in the list
    new_list.append({'text': new_text, 'span': {'start': start, 'end': end}})
    return new_list
Ted Nguyen
  • 66
  • 4
  • Thanks for your suggestion! I did try something similar, but it provides some error, since some solutions are erased for no reason or merged by adding multiple letters at the end/start. I'll try to modify it accordingly and post the solution in the comments. – Aurora Arctic Feb 15 '23 at 11:26
0
def get_s_e(x):
    s, e = map(x['span'].get, ['start', 'end'])
    return s, e


def concat_dict(a):
    a = sorted(a, key=lambda x: x['span']['start'], reverse=True)

    index = 0
    while index < len(a):
        cur = a[index]
        try:
            nxt = a[index+1]
        except:
            break
        cur_st, cur_end = get_s_e(cur)
        nxt_st, nxt_end = get_s_e(nxt)

        if cur_st <= nxt_end:
            join_index = cur_st-nxt_st

            if nxt_end >= cur_end:
                text = nxt['text']
                a[index]['span']['end'] = nxt_end
            else:
                text = n['text'][:join_index]+cur['text']

            a[index]['text'] = text
            a[index]['span']['start'] = nxt_st

            del a[index+1]
        else:
            index += 1

    return a
a = [{"text": "Book bf dj Schedule 15.1 of the", "span": {"start": 745, "end": 776}},
     {"text": "Schedule 15.1", "span": {"start": 756, "end": 770}},
     {"text": "15.1 of the Framework Agreement", "span": {"start": 765, "end": 796}},
     {"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}}
    ]
print(concat_dict(a))

Output:

[{'text': '17.14 of the book', 'span': {'start': 1883, 'end': 1900}},
 {'text': 'Book bf dj Book bf d15.1 of the Framework Agreement',
  'span': {'start': 745, 'end': 796}}]
Mazhar
  • 1,044
  • 6
  • 11