4

How do you modify the default spacy (v3.0.5) tokenizer to correctly split english contractions if unicode apostrophes (not ') are used.

import spacy

nlp = spacy.load('en_core_web_sm')
apostrophes = ["'",'\u02B9', '\u02BB', '\u02BC', '\u02BD', '\u02C8', '\u02CA', '\u02CB', '\u0060', '\u00B4']
for apo in apostrophes:
    text = f"don{apo}t"
    print([t for t in nlp(text)])
>>> 
 [do, n't]
 [donʹt]
 [donʻt]
 [donʼt]
 [donʽt]
 [donˈt]
 [donˊt]
 [donˋt]
 [don`t]
 [don´t]

The desired output for all examples is [do, n't]

My best guess was to extend the default tokenizer_exceptions with all possible apostrophe variations. But this does not work as Tokenizer special cases are not allowed to modify text.

import spacy 
from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex

nlp = spacy.load('en_core_web_sm')

apostrophes = ['\u02B9', '\u02BB', '\u02BC', '\u02BD', '\u02C8', '\u02CA', '\u02CB', '\u0060', '\u00B4']
default_rules = nlp.Defaults.tokenizer_exceptions
extended_rules = default_rules.copy()
for key, val in default_rules.items():
    if "'" in key:
        for apo in apostrophes:
            extended_rules[key.replace("'", apo)] = val

rules = nlp.Defaults.tokenizer_exceptions
infix_re = compile_infix_regex(nlp.Defaults.infixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

nlp.tokenizer =  spacy.tokenizer.Tokenizer(
        nlp.vocab,
        rules = extended_rules,
        prefix_search=prefix_re.search,
        suffix_search=suffix_re.search,
        infix_finditer=infix_re.finditer,
    )
            
apostrophes = ["'",'\u02B9', '\u02BB', '\u02BC', '\u02BD', '\u02C8', '\u02CA', '\u02CB', '\u0060', '\u00B4']
for apo in apostrophes:
    text = f"don{apo}t"
    print([t for t in nlp(text)])

>>> ValueError: [E997] Tokenizer special cases are not allowed to modify the text. This would map ':`(' to ':'(' given token attributes '[{65: ":'("}]'.
gustavz
  • 2,964
  • 3
  • 25
  • 47

1 Answers1

3

You just need to add an exception without changing the text.

import spacy 

nlp = spacy.load('en_core_web_sm')

from spacy.attrs import ORTH, NORM
case = [{ORTH: "do"}, {ORTH: "n`t", NORM: "not"}]
tokenizer = nlp.tokenizer
tokenizer.add_special_case("don`t", case)

doc =  nlp("I don`t believe in bugs")

print(list(doc))
# => [I, do, n`t, believe, in, bugs]

If you want to change the text you should do it as a preprocessing step.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • yes this works. But this apostrophe is not only existing in the word `don't` . It may appear anywhere, so i would need to write down all possible occurences of apostrophes anywhere. This seems quite a lot work with this method. – gustavz Apr 26 '21 at 06:26
  • If you want something simpler, you could use a regex to replace the weird punctuation as a preprocessing step, something like `n[]t`. That wouldn't preserve the input but it seems like that probably isn't important? – polm23 Apr 26 '21 at 06:43
  • 1
    yes, i implemented the functionality now as full-string preprocessing step replacing all apostrophes before using the tokenizer. I had the wish to include this functionality in the tokenizer and not have a dedicated extra pre-processing step, but apparently this is the best way to solve it. – gustavz Apr 26 '21 at 06:50