I have created my own simple sentence segmentation rule to sentencize on a new line (and keep the default segmentation rules as well):
import spacy
nlp = spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text.startswith('\n') or token.text == '\n':
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe(set_custom_boundaries, before='parser')
nlp.pipe_names
This is working fine for most cases. But there's one line which has been constantly a pain.
doc = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker.\n\n Please M60 6ES!\n Mobile: +44 (0)793 990 2594\nReception: +44 (0)161 296 8956\n\n')
This produces the following output and I cannot make any sense out of it:
"Management is doing things right; leadership is doing the right things."
-Peter Drucker.
Please M60 6ES!
Mobile: +44 (0)793
990 2594
Reception: +44 (0)161 296 8956
I would expect mobile number to be just 1 sentence (like Reception number). Like this:
"Management is doing things right; leadership is doing the right things."
-Peter Drucker.
Please M60 6ES!
Mobile: +44 (0)773 990 2504
Reception: +44 (0)161 236 8256
But no matter what I try, it wont join up with +44 0(793). Is it because of some default Spacy rule?
Can you please help.