Python Textacy pos_regex_matches vs matches

Question

I'm trying to find verbs in a sentence with python for a NLP problem. I found an old answer here on stackoverflow and it works with the deprecated pos_regex_matches. Using the new matches function I have a pretty boring problem. The new function returns any match and not only the longest match (which pos_regex_matches does).

pattern = r'<VERB>*<ADV>*<VERB>+<PART>*'
verb_pattern = [{"POS": "VERB", "OP": "*"},{"POS": "ADV", "OP": "*"},{"POS": "VERB", "OP": "+"},{"POS": "PART", "OP": "*"}]

t_list_1 = textacy.extract.pos_regex_matches(text, pattern)
t_list_2 = textacy.extract.matches(text, verb_pattern)

As you can see the pattern is the same, but the matches function's one is in the new format. The old pos_regex_matches returns, for example, was celebrating while the new matches returns both was and was celebrating. Does someone has encountered the same problem? Is a pattern problem or a textacy problem?

Thanks in advance

score 2 · Answer 1 · answered May 29 '20 at 12:29

2

I have had the same issue. A quick solution maybe is the filter_spans from spacy utilities.

Specifically, I am posting below a try to fix your example.

from spacy.util import filter_spans

t_list_2 = filter_spans(t_list_2)

I hope it will help you.

answered May 29 '20 at 12:29

Ilias Tsoumas

41
10

score 0 · Answer 2 · answered Apr 01 '20 at 13:03

I have the same problem. While I've not been able to find a flag to enable the greedy matching of the expression to return the longest matches and not the subparts, I have written this small piece of code that manually removes the matches that are not maximal.

pattern = r'<VERB>*<ADV>*<VERB>+<PART>*'
verb_pattern = [{"POS": "VERB", "OP": "*"},{"POS": "ADV", "OP": "*"},{"POS": 
"VERB", "OP": "+"},{"POS": "PART", "OP": "*"}]

t_list_1 = textacy.extract.pos_regex_matches(text, pattern)
t_list_2 = textacy.extract.matches(text, verb_pattern)

# take the longest when overlapping
for i, el_i in enumerate(t_list_2):
    for j in range(i):
        el_j = t_list_2[j]
        if not el_j:
            continue
        if el_j.start <= el_i.start and el_j.end >= el_i.end:
            # el_i inside el_j
            t_list_2[i] = None
            break
        elif el_i.start <= el_j.start and el_i.end >= el_j.end:
            # el_j inside el_i
            t_list_2[j] = None
        elif el_i.end > el_j.start and el_i.start < el_j.end:
            raise ValueError('partial overlap?')
t_list_2 = [el for el in t_list_2 if el]

Python Textacy pos_regex_matches vs matches

2 Answers2