9

In Spacy 2.x, I use the matcher to find specific tokens in my text corpus. Each rule has an ID ('class-1_0' for example). During parse, I use the callback on_match to handle each match. Is there a solution to retrieve the rule used to find the match directly in the callback.

Here is my sample code.

txt = ("Aujourd'hui, je vais me faire une tartine au beurre "
       "de cacahuète, c'est un pilier de ma nourriture "
       "quotidienne.")

nlp = spacy.load('fr')

def on_match(matcher, doc, id, matches):
    span = doc[matches[id][1]:matches[id][2]]
    print(span)
    # find a way to get the corresponding rule without fuzz

matcher = Matcher(nlp.vocab)
matcher.add('class-1_0', on_match, [{'LEMMA': 'pilier'}])
matcher.add('class-1_1', on_match, [{'LEMMA': 'beurre'}, {'LEMMA': 'de'}, {'LEMMA': 'cacahuète'}])

doc = nlp(txt)
matches = matcher(doc)

In this case matches return :

[(12071893341338447867, 9, 12), (4566231695725171773, 16, 17)]

12071893341338447867 is a unique ID based on class-1_0. I cannot find the original rule name, even if I do some introspection in matcher._patterns.

It would be great if someone can help me. Thank you very much.

user3313834
  • 7,327
  • 12
  • 56
  • 99
k3z
  • 538
  • 5
  • 14

2 Answers2

12

Yes – you can simply look up the ID in the StringStore of your vocabulary, available via nlp.vocab.strings or doc.vocab.strings. Going via the Doc is pretty convenient here, because you can do so within your on_match callback:

def on_match(matcher, doc, match_id, matches):
   string_id = doc.vocab.strings[match_id]

For efficiency, spaCy encodes all strings to integers and keeps a reference to the mapping in the StringStore lookup table. In spaCy v2.0, the integers are hash values, so they'll always match across models and vocabularies. Fore more details on this, see this section in the docs.

Of course, if your classes and IDs are kinda cryptic anyways, the other answer suggesting integer IDs will work fine, too. Just keep in mind that those integer IDs you choose will likely also be mapped to some random string in the StringStore (like a word, or a part-of-speech tag or something). This usually doesn't matter if you're not looking them up and resolving them to strings somewhere – but if you do, the output may be confusing. For example, if your matcher rule ID is 99 and you're calling doc.vocab.strings[99], this will return 'VERB'.

Ines Montani
  • 6,935
  • 3
  • 38
  • 53
  • 1
    Thank you. I tested your answer, it points to the right direction. But to get the string ID you need to use the integer encoded match rule, not `match_id`. `string_id = doc.vocab.strings[matches[id][0]]` Thanks again. – k3z Nov 29 '17 at 07:29
  • And thanks for your incredible achievement with spacy 2.0 :) – k3z Nov 29 '17 at 08:11
2

While writing my question, as often, I found the solution.

It's dead simple, instead of using unicode rule id, like class-1_0, simply use a interger. The identifier will be preserved throughout the process.

matcher.add(1, on_match, [{'LEMMA': 'pilier'}])

Match with

[(1, 16, 17),]
k3z
  • 538
  • 5
  • 14