0

How do you add custom punctuation (e.g. asterisk) to the infix list in a Tokenizer and have that recognized by nlp.explain as punctuation? I would like to be able to add characters that are not currently recognized as punctuation to the punctuation list from the list of set infixes so that the Matcher can use them when matching {'IS_PUNCT': True} .

An answer to a similar issue was provided here How can I add custom signs to spaCy's punctuation functionality?

The only problem is I am unable to package the newly recognized punctuation with the model. A side note: the tokenizer already recognizes infixes with the desired punctuation, so all that is left is propagating this to the Matcher.

Vy Do
  • 46,709
  • 59
  • 215
  • 313
aoa4eva
  • 66
  • 5

1 Answers1

2

The lexeme attribute IS_PUNCT is completely separate from any of the tokenizer settings. In a packaged pipeline, you'd either create a custom language (https://spacy.io/usage/linguistic-features#language-subclass) or run the customization in a callback in [nlp.before_creation] (https://spacy.io/usage/training#custom-code-nlp-callbacks).

Be aware that modifying EnglishDefaults affects all English pipelines loaded in the same script, so the custom language option is cleaner (in particular if you're distributing this model for general use), but also slightly more work to implement.

On the other hand, if you're just using the Matcher, it might be easier to use a REGEX pattern to match the tokens you want instead of customizing IS_PUNCT.

aab
  • 10,858
  • 22
  • 38
  • Thanks! This definitely answers the question asked. It looks like a custom language option will be the best option, but before going down that road is it possible to use Regex to look for *** or / or * without knowing how many will be between desired entity types? – aoa4eva Nov 04 '21 at 17:57
  • You can use a star operator, see the Matcher docs. – polm23 Nov 07 '21 at 03:27
  • @polm23 thanks. Could you please add sample code to match a use case similar to "I live at 1923~~Main~Street~Nashville TN 37011*and work at *203 South Main Street Suite #5 Chicago 60007, which are 10256 miles apart." – aoa4eva Nov 08 '21 at 22:25