Spacy: how to add the colon character in the list of special case tokenization rules

Question

I have the following sentence:

'25) Figure 9:“lines are results of two-step adsorption model” -> What method/software was used for the curve fitting?'

I would like to separate the colon from the rest of the words.

By default, here is what Spacy returns:

print([w.text for w in nlp('25) Figure 9:“lines are results of two-step adsorption model” -> What method/software was used for the curve fitting?')])

['25', ')', 'Figure', '9:“lines', 'are', 'results', 'of', 'two', '-', 'step', 'adsorption', 'model', '”', '-', '>', 'What', 'method', '/', 'software', 'was', 'used', 'for', 'the', 'curve', 'fitting', '?']

What I would like to get is:

['25', ')', 'Figure', '9', ':', '“', lines', 'are', 'results', 'of', 'two', '-', 'step', 'adsorption', 'model', '”', '-', '>', 'What', 'method', '/', 'software', 'was', 'used', 'for', 'the', 'curve', 'fitting', '?']

I was trying to do:

# Add special case rule
special_case = [{ORTH: ":"}]
nlp.tokenizer.add_special_case(":", special_case)

But no results, the print shows the same value.

Sergey Bushmanov · Accepted Answer · 2020-12-14T14:02:14.600

Try modifying nlp.tokenizer.infix_finditer with compile_infix_regex:

import spacy
from spacy.util import compile_infix_regex

text = "'25) Figure 9:“lines are results of two-step adsorption model” -> What method/software was used for the curve fitting?'"
 
nlp = spacy.load("en_core_web_md")
infixes = (":",) + nlp.Defaults.infixes
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

doc = nlp(text)

for tok in doc:
    print(tok, end =", ")

', 25, ), Figure, 9, :, “lines, are, results, of, two, -, step, adsorption, model, ”, -, >, What, method, /, software, was, used, for, the, curve, fitting, ?, ',

score 0 · Answer 2 · answered Dec 14 '20 at 13:48

0

Simply use word_tokenize

from nltk.tokenize import word_tokenize 
string_my='25) Figure 9:“lines are results of two-step adsorption model” -> What method/software was used for the curve fitting?'
word_tokenize(string_my) 

['25', ')', 'Figure', '9', ':', '“', 'lines', 'are', 'results', 'of', 'two-step', 'adsorption', 'model', '”', '-', '>', 'What', 'method/software', 'was', 'used', 'for', 'the', 'curve', 'fitting', '?']

answered Dec 14 '20 at 13:48

Teo

1

Thanks @Teo, but this is not a solution. I need to train my Spacy model and for this, the tokenization rules have to be updated. What you provided here is a workaround on how to display what I want. The display in my case is used for the example only. – Milos Cuculovic Dec 14 '20 at 13:52
Ok, then what you want is word_tokenize done in spacy? – Teo Dec 14 '20 at 13:57
Spacy has his own tokenizer, see @Sergey Bushmanov's reply. – Milos Cuculovic Dec 14 '20 at 14:00

Spacy: how to add the colon character in the list of special case tokenization rules

2 Answers2