Split text by sentence but not by special patterns

Question

This is my sample text:

text = "First sentence. This is a second sentence. I like pets e.g. cats or birds."

I have a function which splits texts by sentence

library(stringi)
split_by_sentence <- function (text) {

  # split based on periods, exclams or question marks
  result <- unlist(strsplit(text, "\\.\\s|\\?|!") )

  result <- stri_trim_both(result)
  result <- result [nchar (result) > 0]

  if (length (result) == 0)
    result <- ""

  return (result)
}

which actually splits by punctuation characters. This is the output:

> split_by_sentence(text)
[1] "First sentence"            "This is a second sentence" "I like pets e.g"           "cats or birds."

Is there a possibility to exclude special patterns like "e.g."?

Thank you, but your solution deletes "e.g.". I would like to keep this. — WinterMensch, Dec 15 '17 at 08:57

Cath · Accepted Answer · 2017-12-15T09:32:27.760

In your pattern, you can specify that you want to split at any punctuation mark that is followed by a space, if there is at least 2 alphanumeric characters prior to it (using lookaround). Which will result in:

unlist(strsplit(text, "(?<=[[:alnum:]]{3})[?!.]\\s", perl=TRUE))
#[1] "First sentence"                  "This is a second sentence"       "I like pets e.g. cats or birds."

If you want to keep the punctuation marks, then you can add the pattern inside the look-behind and only split on space:

unlist(strsplit(text, "(?<=[[:alnum:]]{3}[[?!.]])\\s", perl=TRUE))
# [1] "First sentence."                 "This is a second sentence."      "I like pets e.g. cats or birds."

text2 <- "I like pets (cats and birds) and horses. I have 1.8 bn. horses."

unlist(strsplit(text2, "(?<=[[:alnum:]]{3}[?!.])\\s", perl=TRUE))
#[1] "I like pets (cats and birds) and horses." "I have 1.8 bn. horses."

N.B.: If you may have more than one space after the punctuation mark, you can put \\s+ instead of \\s in the pattern

@WinterMensch see the edit, it should work with your data now (except if you have shortcuts with 3 or more letters but then it could be word so...). Let me know if it's ok. (I also changed the marks so it's only dot, exclamation and question marks) — Cath, Dec 15 '17 at 09:38

score 3 · Answer 2 · edited Nov 19 '21 at 01:48

3

library(tokenizers)

text = "First sentence. This is a second sentence. I like pets e.g. cats or birds."
tokenize_sentences(text)

Output is:

[[1]]
[1] "First sentence."                 "This is a second sentence."      "I like pets e.g. cats or birds."

edited Nov 19 '21 at 01:48

Nimantha

6,405
6
28
69

answered Dec 15 '17 at 09:00

Prem

11,775
1
19
33

Works nicely. But actually I have german texts and this function splits for example at "z.B." (which is the german equivalent to "e.g.") If this could be fixed, I´d be verry happy. – WinterMensch Dec 15 '17 at 09:33
In that case regex solution suggested by @Cath is a better option as the inbuilt function like `Maxent_Sent_Token_Annotator` in `openNLP` also fails for this example. – Prem Dec 15 '17 at 11:54

Split text by sentence but not by special patterns

2 Answers2