4

I have a string which is a fragment of a book (its around 1 chapter) this string is all one line. I would like to make a new line at the end of each sentence

I solved it by a not-so-sophisticated code of

text = text.replaceAll("\\.","\\.\n"); //same for ? same for !

and of course this does not yield very nice results. I dont need this to be perfect but the nicer i can get it the better.

I would like at least to check for following before making a new line character:

the word before the . is longer then 2 characters
there are no dots before the . in the same "word"
the character before the . is not a number
the character after the dot (and possibly a whitespace after that dot) is not a (

Any other suggestions would be really appreciated, along with actual code which will make it happen.

Similar question: Here

Update:

Although not high on my list of priorities because my book doesnt contain a lot of direct quotations nor direct speeches but a rule that handles sentences that are inside those would also be in order so that sentences from the same qoute dont end up on new lines

Community
  • 1
  • 1
Xitcod13
  • 5,949
  • 9
  • 40
  • 81
  • Do none of your sentences start with short words then? I would expect both of the sentences in this comment to count as sentences, but neither of them start with a word longer than two characters. – Jon Skeet May 17 '12 at 15:53
  • (Additionally, consider questions ending in question marks, and also speeches where the period may be followed by a double quote.) – Jon Skeet May 17 '12 at 15:54
  • 1
    How are you going to handle all the abbreviations, direct speeches or ellipses? For example, the sentence: 'Dr. Smith asked: "How are you?", but I didn't answer... for now.' – Jakub Zaverka May 17 '12 at 15:55
  • thank you that was a typo. I wanted to say before the dot not after (i know this makes some sentences not work. but most of them do not end in a word 2 characters or shorter) – Xitcod13 May 17 '12 at 15:55
  • 1
    This thread looks promising: http://stackoverflow.com/questions/4373612/how-to-parse-text-into-sentences-in-java – SirPentor May 17 '12 at 15:57
  • @Xitcode: It's unclear where that restriction has come from though... if it's to avoid abbreviations, then it helps with some but not all... it would really help if you'd give the *reasons* for the suggested rules. – Jon Skeet May 17 '12 at 15:59
  • i just did a search on my string and there are no ellipses. (although thank you for mentioning them) the abriviations i try to handle with making sure that the word is longer then 2 characters and that the are no additional dots in the "word" for example N.A.S.A. has 3 additional dots (sadly that means i wont make a new line if a word ends in abbreviation) and there arent many direct speeches in my book (thank goodness) BUt all very awesome suggestions I will add them to my question just to say what people answer ^^ – Xitcod13 May 17 '12 at 16:00

3 Answers3

3

Stanford's CoreNLP toolkit has a class that does sentence segmentation. See more here.

If you say new DocumentPreprocessor(new StringReader(s)).iterator() where s is a string containing the text, it will give you back an iterator of sentences.

Note that this will tokenize the sentence as well. If you want the sentence to look the way it started, you can either just use this output as a guide for splitting, or run the PTBTokenizer -untok command (see same link as above) to make each tokenized sentence look normal again.

This will almost certainly work better than your list of rules since your rules don't account for many of the important cases.

dhg
  • 52,383
  • 8
  • 123
  • 144
  • thanks if i download Download Stanford CoreNLP version 1.3.1 it will contain the Stanford English Tokenizer right?? Im downloading it right now and i dont want to download the wrong file its 250 MB – Xitcod13 May 18 '12 at 03:31
  • alright as soon as i get it working ill accept your answer. Just want to see how good it is :) – Xitcod13 May 18 '12 at 03:37
1

If I correctly understood your requirements, try something like that:

text = text.replaceAll("[^\\.]{1,}\\D\\.\\s?[^\\(]","\\.\n");
elias
  • 15,010
  • 4
  • 40
  • 65
  • could you explain your code. Does it actually check for what i specified. – Xitcod13 May 17 '12 at 16:20
  • `[^\\.]{1,}\\D` matches one or more characters, except a dot, followed by any other, except a number. `\\s?[^\\(]` matches a possibly whitespace, followed by any character, except a `(` – elias May 17 '12 at 16:27
  • this makes my whole string just periods. I had this problem before I escaped the period character with \\. but this seems o have that allready... Dont know what the problem is any suggestions – Xitcod13 May 17 '12 at 17:23
0
String newline = System.getProperty("line.separator");
String yourLine = yourLine.append(newline);
William Kinaan
  • 28,059
  • 20
  • 85
  • 118