Sentence segmentation rule not working as expected

Question

I have created my own simple sentence segmentation rule to sentencize on a new line (and keep the default segmentation rules as well):

import spacy
nlp = spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text.startswith('\n') or token.text == '\n':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before='parser')
nlp.pipe_names

This is working fine for most cases. But there's one line which has been constantly a pain.

doc = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker.\n\n Please M60 6ES!\n Mobile: +44 (0)793 990 2594\nReception: +44 (0)161 296 8956\n\n')

This produces the following output and I cannot make any sense out of it:

"Management is doing things right; leadership is doing the right things."
-Peter Drucker.


Please M60 6ES!

Mobile: +44 (0)793
990 2594

Reception: +44 (0)161 296 8956

I would expect mobile number to be just 1 sentence (like Reception number). Like this:

"Management is doing things right; leadership is doing the right things."
-Peter Drucker.


Please M60 6ES!

Mobile: +44 (0)773 990 2504


Reception: +44 (0)161 236 8256

But no matter what I try, it wont join up with +44 0(793). Is it because of some default Spacy rule?

Can you please help.

I can't reproduce this. When I execute exactly this code, on spaCy 2.2.4, "Mobile: +44 (0)793 990 2594" is kept as one sentence. Do you perhaps have some hidden unicode character in your string, from copy-pasting it from somewhere? — Sofie VL, Jun 01 '20 at 16:20
@SofieVL I guess so. The data was exported from a csv and read into a dataframe. Earlier, the csv was in some other encoding format (I guess Western-European). But I saved it as UTF8 before reading. (It works fine if put \n twice at the end of `Mobile: +44 (0)793 990 2594\n` ) — Avneet Singh, Jun 01 '20 at 18:57
@SofieVL I managed to get the expected result by using 'en_core_web_lg'. But the 'sm' version displayed strange behavior if i changed 793 to 792,794,797,798 it worked just fine. On '790' it would bring down everything from (0) to end of phone number in the next sentence. Weird — Avneet Singh, Jun 01 '20 at 19:47
Hm, that makes no sense to me, but there could always be some very strange modeling behaviour that crept into the trained parser. Have you tried upgrading your models? Because I didn't have any issue with en_core_web_sm — Sofie VL, Jun 01 '20 at 21:23
@SofieVL Well now I think I need to upgrade them as it is behaving pretty weirdly in some other parts of code as well. Thank you for your help! — Avneet Singh, Jun 02 '20 at 10:17

Sentence segmentation rule not working as expected

0 Answers0