correctly tokenize english contractions with unicode apostrophes

Question

How do you modify the default spacy (v3.0.5) tokenizer to correctly split english contractions if unicode apostrophes (not ') are used.

import spacy

nlp = spacy.load('en_core_web_sm')
apostrophes = ["'",'\u02B9', '\u02BB', '\u02BC', '\u02BD', '\u02C8', '\u02CA', '\u02CB', '\u0060', '\u00B4']
for apo in apostrophes:
    text = f"don{apo}t"
    print([t for t in nlp(text)])
>>> 
 [do, n't]
 [donʹt]
 [donʻt]
 [donʼt]
 [donʽt]
 [donˈt]
 [donˊt]
 [donˋt]
 [don`t]
 [don´t]

The desired output for all examples is [do, n't]

My best guess was to extend the default tokenizer_exceptions with all possible apostrophe variations. But this does not work as Tokenizer special cases are not allowed to modify text.

import spacy 
from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex

nlp = spacy.load('en_core_web_sm')

apostrophes = ['\u02B9', '\u02BB', '\u02BC', '\u02BD', '\u02C8', '\u02CA', '\u02CB', '\u0060', '\u00B4']
default_rules = nlp.Defaults.tokenizer_exceptions
extended_rules = default_rules.copy()
for key, val in default_rules.items():
    if "'" in key:
        for apo in apostrophes:
            extended_rules[key.replace("'", apo)] = val

rules = nlp.Defaults.tokenizer_exceptions
infix_re = compile_infix_regex(nlp.Defaults.infixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

nlp.tokenizer =  spacy.tokenizer.Tokenizer(
        nlp.vocab,
        rules = extended_rules,
        prefix_search=prefix_re.search,
        suffix_search=suffix_re.search,
        infix_finditer=infix_re.finditer,
    )
            
apostrophes = ["'",'\u02B9', '\u02BB', '\u02BC', '\u02BD', '\u02C8', '\u02CA', '\u02CB', '\u0060', '\u00B4']
for apo in apostrophes:
    text = f"don{apo}t"
    print([t for t in nlp(text)])

>>> ValueError: [E997] Tokenizer special cases are not allowed to modify the text. This would map ':`(' to ':'(' given token attributes '[{65: ":'("}]'.

score 3 · Answer 1 · answered Apr 24 '21 at 08:37

3

You just need to add an exception without changing the text.

import spacy 

nlp = spacy.load('en_core_web_sm')

from spacy.attrs import ORTH, NORM
case = [{ORTH: "do"}, {ORTH: "n`t", NORM: "not"}]
tokenizer = nlp.tokenizer
tokenizer.add_special_case("don`t", case)

doc =  nlp("I don`t believe in bugs")

print(list(doc))
# => [I, do, n`t, believe, in, bugs]

If you want to change the text you should do it as a preprocessing step.

answered Apr 24 '21 at 08:37

polm23

14,456
7
35
59

yes this works. But this apostrophe is not only existing in the word `don't` . It may appear anywhere, so i would need to write down all possible occurences of apostrophes anywhere. This seems quite a lot work with this method. – gustavz Apr 26 '21 at 06:26
If you want something simpler, you could use a regex to replace the weird punctuation as a preprocessing step, something like `n[]t`. That wouldn't preserve the input but it seems like that probably isn't important? – polm23 Apr 26 '21 at 06:43
1

yes, i implemented the functionality now as full-string preprocessing step replacing all apostrophes before using the tokenizer. I had the wish to include this functionality in the tokenizer and not have a dedicated extra pre-processing step, but apparently this is the best way to solve it. – gustavz Apr 26 '21 at 06:50

correctly tokenize english contractions with unicode apostrophes

1 Answers1