0

I came across a problem,

I want to split everything that comes afer ". "

For example, if I have the sentences :

"Danny went to school. it was wonderful. "

I want my output will be

Danny went to school.

it was wonderful.

which I can easily solve it by that :

string[] list = currentResult.Split(new string[] { ". " }, StringSplitOptions.None);

BUT!

what if I have for example :

  1. Danny went to School. and : 2. James went to school as well.

my output will be :

1.

Danny went to School. and :

2.

James went to school as well

.

I dont want it to split it when there is a number before the dot, for example. Can I solve it somehow ?

Thanks!

anouar.bagari
  • 2,084
  • 19
  • 30
thormayer
  • 1,070
  • 6
  • 28
  • 49
  • 2
    what do you want the output to be? I can't quite follow what you want in the numeric case. – Woot4Moo May 01 '13 at 13:02
  • You need to get to grips with markdown formatting as the formatting is seriously impairing the quality of this question. http://stackoverflow.com/editing-help – spender May 01 '13 at 13:03
  • @Woot4Moo: He wants it like on one line: "1. Danny went to School. and :" and then on the next "2. James went to school as well." –  May 01 '13 at 13:03
  • @0A0D Ah ok, maybe it was the formatting that was killing me. – Woot4Moo May 01 '13 at 13:04
  • Do you want to split the string only when there is a letter before the period? It's a simple regex split either way. – Chris May 01 '13 at 13:05
  • 4
    Supposing the sentence "And the number was 2." crops up? – spender May 01 '13 at 13:05
  • 1
    In your second example, why is there no split on `School. and`? – alexn May 01 '13 at 13:08
  • What is the nature of the string list? Where is it coming from? Maybe there is a better solution available than splitting the string at all, due to the case @spender mentioned. – Chris May 01 '13 at 13:08
  • yes. basically every ending sentence.. – thormayer May 01 '13 at 13:08
  • Why not just find the substring of a number plus a dot and go from there? –  May 01 '13 at 13:09
  • 2
    I think you cannot solve this problem properly. Sometimes `.` **is** actual data and sometimes it's used to indicate **where to split** the data. I think the data has to be prepared in some way that you can differentiate between these two cases programmatically. – Michael Schnerring May 01 '13 at 13:10
  • I have articles that I want to break into separete lines. but some of the articles have things like numeric count and such. – thormayer May 01 '13 at 13:10
  • Definitely a regex issue. Capture only when the period does not follow a number. Split on the indices of the periods that do not follow numbers. There are, of course, ways to code this but they will look bloated compared to just using a regular expression. – emd May 01 '13 at 13:13
  • What about "I watched Mr. Bean last night."? – mbeckish May 01 '13 at 13:13
  • possible duplicate of [How do you parse a paragraph of text into sentences? (perferrably in Ruby)](http://stackoverflow.com/questions/860809/how-do-you-parse-a-paragraph-of-text-into-sentences-perferrably-in-ruby) – mbeckish May 01 '13 at 13:16
  • @mbeckish you are correct. thats another problem to solve.. can I overload Regex Split ? – thormayer May 01 '13 at 13:16
  • 2
    @thormayer - Read my link - you won't catch all the cases by doing this from scratch. Find a library built by people who did research on the subject. – mbeckish May 01 '13 at 13:18
  • The answer is not regular expressions. What about sentences that aren't terminated with a full stop, non sentences, all possible alternative uses of `"."` i.e. abbreviating? What about foreign langauges and cultures? Spilleng mistakes etc. There is no perfect solution. Some effort has already been made http://stackoverflow.com/questions/8468874/nlp-framework-for-net – Jodrell May 01 '13 at 14:15

3 Answers3

1

The problem here is how to deal with oddly formatted data, if you have control over your data you might consider using 1) and 2) instead of 1. and 2.; however if this is not the case then you might have to resort to regex to discern where a . is part of a line or the end of one as this functionality is past the capabilities of String.Split

Izzy
  • 1,764
  • 1
  • 17
  • 31
1

You could always go character by character, and do something like:

NOTE: Untested, but looks right :)

List<string> strings = new List<string>();
int curStart = 0;
for(int index=0;index<str.Length;index++) {
    if(index > 0) {
        if(str[index] == '.') {
            if(!char.IsNumeric(str[index-1])) {
                strings.Add(str.SubString(curStart, index-curStart));
                curStart = index + 1;
            }
        }
    }
}
Jeremy Boyd
  • 5,245
  • 7
  • 33
  • 57
0

I thought I'd take a stab at producing an answer matching to what you ask, where as the comments make allot of sense in the larger scope of what you want.

Find out how to use regex with C# code from :http://www.dotnetperls.com/regex-matches

I used http://regexpal.com/ to confirm my regex. Play around with that or a similar page to get a handle on regex. It's worth knowing how to regex.

Look at http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet or someplace else for a list of the commands and definitions for regex.

the regex ".*?\D[.||:]\s" will turn the string:

1. Danny went to School. and : 2. James went to school as well. Danny went to school. it was wonderful. 

into the following matches (separated here by new lines):

1. Danny went to School. 
and : 
2. James went to school as well. 
Danny went to school. 
it was wonderful. 

Note that I took the liberty to separate matches based on ':' as well since your example does so.

amalgamate
  • 2,200
  • 5
  • 22
  • 44