Best way to split text into sentences avoiding acronyms clashes

Question

Given the following phrase

Ms. Mary got to know her husband Mr. Dave in her trip to U.S.A. and it was cool. Did you know Dave worked for Microsoft? Well he did. He was even part of Internet Explorer devs.

What is the best "pseudo-code" way to split it into sentences? Python or any other similar language is also fine because of its pseudo-code resemblance.

What I've thought is to replace every occurrence of " a-zA-Z." (notice the space), ".a-zA-Z" and ".a-zA-Z." to its equivalent without the dot of course, so for example

" a."
" b."
" c."
" d."
" e."
" f."
...

and

".a."
".b."
".c."
".d."
".e."
".f."
...

and

" ab."
" ac."
" ad."
...
" ba."
" bc."
" bd."
...

The phrase should be nicely converted to the following

Ms Mary got to know her husband Mr Dave in her trip to USA and it was cool. Did you know Dave worked for Microsoft? Well he did. He was even part of Internet Explorer devs.

...or am I wrong somewhere and I have a flawed logic?

For the future what's your question comments, I need to know what's the best way to split the example text into correct sentences avoiding clashes with acronyms.

This either explained in pseudo-code, Python or other languages similar to pseudo-code. I want it to be language agnostic so it can be implemented by anyone, regardless of the language they use.

What would you suggest for "I made a trip to the U.S.A. It was cool."? — Jongware, Sep 06 '14 at 19:01
Ultimately, regular language cannot be parsed this easy. Consider `To get his B.Sc. Ed had to study day and night.` versus `It was not easy to get his B.Sc. Ed had to study day and night.` — Jongware, Sep 08 '14 at 14:13
@AbuMusabBinZarqawi Exactly; Asking for pseudocode in the first place is a *strong* indicator it's way too broad - you don't even have a language in mind. You've also made no apparent attempt, and your question amounts to "I need to know what's the best way to split the example text into correct sentences avoiding clashes with acronyms." Not only too broad; but perhaps also primarily opinion based. But, *you know these things very well already*. — Andrew Barber, Sep 09 '14 at 15:04

Jongware · Accepted Answer · 2014-09-07T23:57:43.233

All acronyms in the example are of the pattern Uppercase . or Uppercase lowercase .; none of the other -- regular -- occurrences of the full stop match this particular pattern.

So a simple RegEx can be used to remove the full stops. What's left after that can be split on the regular punctuation marks .!?. In Javascript:

str2 = str.replace(/([A-Z][a-z]?)\./g, '$1');

or using a GREP flavor that does understand most common character classes:

str2 = str.replace(/(\u\l?)\./g, '$1');

This results directly in the output as shown.

Using a RegEx is straightforward (and easily expanded!), but the same pattern can be tested in other languages as well. In C, you can copy input to output and test only when seeing the . character:

int main (void)
{
    char input[] = "Ms. Mary got to know her husband Mr. Dave in her trip to "
       "U.S.A. and it was cool. Did you know Dave worked for Microsoft? Well "
       "he did. He was even part of Internet Explorer devs.";
    char output[256], *readptr, *writeptr;

    printf ("in: %s\n", input);

    readptr = input;
    writeptr = output;
    while (*readptr)
    {
        if (*readptr == '.')
        {
            if ((readptr > input && isupper(readptr[-1])) ||
                (readptr > input+1 && isupper(readptr[-2]) && islower(readptr[-1])))
            {
                readptr++;
                continue;
            }
        }
        *writeptr = *readptr;
        readptr++;
        writeptr++;
    }

    *writeptr = 0;
    printf ("out: %s\n", output);

    return 0;
}

These solutions remove full stops from the source text. If you want to keep them, you can replace them with a placeholder (for example, a character that does not normally occur in the source text), or do the reverse: when splitting on sentences, test to see whether or not a full stop is a valid breaking point.

Afterthought: it does work on the original sample sentence... but it does not on the one in the comments:

I made a trip to the U.S.A. It was cool.I liked it very much.

where you get the output

I made a trip to the USA It was cool.I liked it very much.

This requires checking for more possible scenarios:

common abbreviations, such as Ms. and Mr.: \u\l\.
in-sentence acronyms; "U.S.A." followed by a lowercase: (\u\.)+ (?=\l), where the full stop needs removing;
end-of-sentence acronyms; "U.S.A." followed by an uppercase: (\u\.)+ (?=\u), where the last full stop should remain.

Best way to split text into sentences avoiding acronyms clashes

1 Answers1