0

This is my code [^\.!\?]+[!\?\.]

I want to separate every sentence perfectly in a post. I am using javascript regex. The problem is when the dot(.) is between characters without spaces so they are separated when they should be merged.

For example: "Apa yang terjadi? Aku terkena musibah! Uang saya 90.000 dicuri maling."

Uang saya 90.

and

000 dicuri maling.

should merge into

Uang saya 90.000 dicuri maling.

See attached picture below

Regular Expression Tester

Hedi Herdiana
  • 101
  • 1
  • 11

3 Answers3

2

Try ([.!?])\s to create array like the following:

let str =  "Apa yang terjadi? Test test test. Aku terkena musibah! Uang saya 90.000 dicuri maling."
str = str.split(/([.!?])\s/g);
let res = [];
for(let i=0; i <= str.length; i=i+2){
  let x = str.length-1 > i? str[i+1] : '';
  let newstr = str[i] + x;
  res.push(newstr);
}
console.log(res);
Mamun
  • 66,969
  • 9
  • 47
  • 59
1

This should work in most occasions.

(?=[^ ]|^).+?[?!.](?= |$|\n)

Checked here: https://regexr.com/

CSharpFiasco
  • 204
  • 3
  • 8
1

Even better, you can use the following syntax that will accept several spaces and other blank characters after the sentence ending character and the leading blank characters will not be part of the string that will be extracted!!!

[^\s].+?[?!.](?=\s+|$)

Limitations:

  • for example 10 B.C. and other abbreviations will be detected as sentence...
  • strings like: terkena musibah!Uang saya 90.000 dicuri maling. will be detected as one sentence...

New version:

I have adapted the regex in the following way, to solve the limitations of the regex proposed so far:

[^\s.!?][a-zA-Z@#$%^&,;"':*()-_+=/\\|{}><()[\]\s\d]*?([?!]|((?<=[^A-Z])\.(?=[^0-9])))

and I have test it on the following text:

Apa ya{ng terjadi? Ak[u +10 B.C. ter,ke]na 10.3 mus}ibah.Uang say\a 90!000 dic&uri ma|ling.
Apa yang te*r(j)adi? Aku terkena mus%ibah! Uang sa^ya 90.000 dicuri maling.
ter;ke|na mus-ibah?uang saya 90..000 dicuri m"aling.
ter@kena mus+ibah!ua=ng say$a 90?000 dicuri ma'ling.
terk\ena mus#ibah.uang saya 90.000 dicuri maling.
Apa yang terjadi? Aku 10 B. C. terke\na mu/sibah.Uang saya 90!000 dicuri maling.
Apa yang terjadi? Aku -10 B. C. terke\na mu/sibah. Uang saya 90!000 dicuri maling.

Advantages:

Abbreviations are preserved: Ak[u +10 B.C. ter,ke]na 10.3 mus}ibah. is seen as one sentence, preserving the B.C.

terkena musibah!Uang saya 90.000 dicuri maling. would be separated in two sentences: terkena musibah! and Uang saya 90.000 dicuri maling.

Good luck!

Allan
  • 12,117
  • 3
  • 27
  • 51
  • [new version] not working in tester https://www.regextester.com/?fam=99639 – Hedi Herdiana Nov 29 '17 at 07:31
  • I am very interested in regex `[^\s].+?[?!.](?=\s+|$)` but it would be better if 3 digits before dot(.) merged with the sentence afterwards. Example: `10. Ten.` and `100. One hundred.` Look https://www.regextester.com/?fam=99651 – Hedi Herdiana Nov 30 '17 at 01:08