3

I have a problem, I want split a text into sentence using fullstop (.)

For instance:

Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

If I split the above text, I got 3 sentences like,

1. Mr.

2. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

3. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.


I want to include Mr. in the second sentence as the text should split into two sentence not to three.

1. Mr. Bean is a British comedy television series of 14 half-hour episodes starring Rowan Atkinson as the title character. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

2. Different episodes were written by Atkinson, Robin Driscoll, Richard Curtis and one by Ben Elton.

Kindly help me. I appreciate the instant feedback from the community.

Thanks.

Rais Hussain
  • 203
  • 1
  • 2
  • 12

3 Answers3

6

If you are looking for a way to avoid splitting sentences after an abbreviation (like a.m.), that's a difficult natural language problem.

If you just want to split sentences without worrying about Mr. or Mrs. (and have a character that won't likely show up in the text, like *), here's a simple way:

  1. replace all instances of Mr. and Mrs. with Mr* and Mrs*
  2. split text on .
  3. in the resulting array, replace all instances of Mr* and Mrs* with Mr. and Mrs.

Here's a version that uses NUL as a sentinel character, as it's pretty much impossible for it to show up in text unintentionally:

static IEnumerable<string> Splitter(string sentences)
{
    char sentinel = '\0';
    return sentences.Replace("Mr.", "Mr" + sentinel)
        .Replace("Mrs.", "Mrs" + sentinel)
        .Split(new[] { ". " }, StringSplitOptions.None)
        .Select(s => s.Replace("Mr" + sentinel, "Mr.")
                        .Replace("Mrs" + sentinel, "Mrs."));
}

If you're the paranoid sort of person who thinks any particular character is liable to show up in your text, feel free to use a GUID for the sentinel.

Gabe
  • 84,912
  • 12
  • 139
  • 238
  • The source text "Mr. Mr* Bean" would then be turned into "Mr. Mr. Bean" – Gareth Mar 16 '11 at 14:08
  • @Gareth: I told the OP to choose a character that's not likely to show up in his text, like `*`. If `Mr*` is likely to show up in his text, he *shouldn't* use `*` as the character! – Gabe Mar 16 '11 at 14:14
  • @Gabe - By choosing a less obvious character you're just making sure the problem doesn't show up until you're totally not expecting it – Gareth Mar 16 '11 at 14:28
  • @Gareth: Just choose NUL as the character if you're worried about it. Or are you worried that a NUL is going to appear amidst some random English sentences? – Gabe Mar 16 '11 at 14:56
3

The only way (I can think of right now) to do this, is to add intelligence to the split function. When to use the . as delimiter and when not.

You can do this like:

  1. Replace all occurences of <dot> by <dot><dot>.
  2. Replace all Mr. (and other entries in the dictionary) by Mr<dot>.
  3. Split the text using the remaining dots.
  4. Replace all Mr<dot> (and other...) by Mr. .
  5. Replace all occurences of <dot><dot> by <dot>.

Of course you can use another escape character/string.

You can keep a dictionary of translations. Preferably in a file so you can use a different dictionary for different languages.

Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
Toon Krijthe
  • 52,876
  • 38
  • 145
  • 202
  • This still has an edge case similar to that found in Gabe's answer. The edge case only applies when you have a delimiter directly to the right of a Mr./Mrs. For example, `MrText`, which will become `Mr[split]Text` instead of `Mr[split]Text`. So, the error window is narrower, but not gone. Of course, if your splitter function (with `` as a delimiter) always splits `` as `[split]` instead of `[split]`, this issue would be resolved. This could be done using `Regex.Split`...but with `Regex.Split` none of this fanciness is necessary :) – Brian Mar 16 '11 at 20:03
1
static IEnumerable<string> Splitter(string sentences)
{
    foreach (string s in 
        Regex.Split(sentences, "(?<!((mr)|(mrs)))\\.", RegexOptions.IgnoreCase))
    {
        if (!String.IsNullOrWhiteSpace(s)) yield return s.Trim() + ".";
    }
}

A simple regex-based answer using negative look-behind.

Brian
  • 25,523
  • 18
  • 82
  • 173
  • but what happend if you split the following sentence. ["New Zealand’s second largest city has had to be closed immediately due to the 6.3 magnitude earthquake on February 22, 2011."]. I would break it into two sentences if I use your method to split the text. you effort is appreciable. – Rais Hussain Mar 17 '11 at 04:52
  • @Rais: Every method posted suffers from that problem. As Gabe mentioned, that is a difficult natural language problem. It is a *very* hard problem. A simple and mostly reasonable approach would be to change the regex to split on `\\.\\s` instead of `\\.` , but it wouldn't be 100% accurate. – Brian Mar 17 '11 at 05:34