Improve my Regex to include numbers that contain decimals and percentage signs

Question

I have the following regex which will capture the first N words and finish at the next period, exclamation point or question mark. I need to get chunks of texts that vary in the number of words but I want complete sentences.

regex = (?:\w+[.?!]?\s+){10}(?:\w+,?\s+)*?\w+[.?!]

It works with the following text:

Therapy extract straw and chitosan from shrimp shells alone accounted for 2, 4, 6, 8 and 10% found that the extract straw 8% is highly effective in inhibiting the growth of algae Microcystis spp. The number of cells and the amount of chlorophyll a was reduced during treatment. Both value decreased continuous until the end of the trial.

https://regex101.com/r/ardIQ7/5

However it won't work with the following text:

Therapy extract straw and chitosan from shrimp shells alone accounted for 2, 4, 6, 8 and 10% found that the extract straw 8.2% is highly effective in inhibiting the growth of algae Microcystis spp. The number of cells and the amount of chlorophyll a was reduced during treatment. Both value decreased continuous until the end of the trial.

That is because of the digits (8.2%) with decimals and %.

I have been trying to figure out how to also capture these items but need some assistance to point me in the right direction. I don't just want to capture the first sentence. I want to capture N words which may include several sentences and returns complete sentences.

I would change @ran_0315's slightly, `[\s\S]*?(?=\.\s|!|\?)` — gmaliar, Sep 20 '18 at 09:27
You need to write, `regex = /(?:\w+[.?!]?\s+){10}(?:\w+,?\s+)*?\w+[.?!]/`. I pointed out in your [earlier question](https://stackoverflow.com/questions/52397635/how-to-select-first-280-words-of-text-up-to-the-closest-period) that this regex does not work when the string contains 10 words with the last word followed by a punctuation mark: `"a1 a2 a3 a4 a5 a6 a7 a8 a9 a10."[regex] #=> nil`. — Cary Swoveland, Sep 20 '18 at 19:19

Nambi_0915 · Answer 1 · 2018-09-20T10:22:03.317

0

Try this, (?:\S+[,.?!]?\s+){1,200}[\s\S]*?(\. |!|\?)

This will match the N number of characters.

If the Nth character didn't end a sentence, then it will match until the previous sentence. The N should be mentioned as {1, N}

Regex

edited Sep 20 '18 at 10:22

answered Sep 20 '18 at 09:31

Nambi_0915

1,091
8
21

Works for the first sentence very well. What I'm trying to do is capture N number of words in a paragraph without getting incomplete sentences. – chell Sep 20 '18 at 09:40
What will be the value of N? – Nambi_0915 Sep 20 '18 at 09:44
Could be 280 words or less – chell Sep 20 '18 at 09:47
ran_0315 I have adapted my regex using your expression as follows: (?:\S+[,.?!]?\s+){200}[\s\S]*?(?=\. |!|\?). It works see here: https://regex101.com/r/ardIQ7/8 the only thing is that it does not capture the sentence ending punctuation. How could I get it to add that feature? – chell Sep 20 '18 at 10:05
Yes it works very well. I want to accept your answer but it may need a bit of modifying so that it answers my specific question. I'm not capturing N characters but N words. I have used part of your answer to complete my regex as follows: (?:\S+[,.?!]?\s+){N}[\s\S]*?(\. |!|\?) – chell Sep 20 '18 at 10:18
1

I'd change the regex to: `(?:\S+\s+){1,200}(?:[^.!?]|\.(?!\s))*.` You can remove the `[,.?!]?` part because they are included in the `\S` char. You can replace `[\s\S]*?` with `.*?`, but since I'm no fan of lazy repetition I replaced it with `(?:[^.!?]|\.(?!\s))*` followed by a `.` to capture the last character of sentence. – 3limin4t0r Sep 20 '18 at 12:36
1

@Johan, if your regex with `{1,200}` replaced by `{1,10}` is `r`, `"a1 a2 a3 a4 a5. a6 a7 a8 a9 a10. a11."[r] #=> "a1 a2 a3 a4 a5. a6 a7 a8 a9 a10. a11."`. I believe `"a1 a2 a3 a4 a5 a6 a7 a8 a9 a10."` should be returned, as it contains the fewest number of sentences whose combined word count is at least `10`. – Cary Swoveland Sep 20 '18 at 21:17
@CarySwoveland You're correct, however so does the regex in the answer (add a space to the end of your string for this to happen). I merely pointed out how it could be optimized. – 3limin4t0r Sep 20 '18 at 21:54
In short you should subtract 1 from N, since the last part matches all characters until the sentence ends. If you use `{1,10}` the end of the sentence is included in the first part of the match, so it adds another sentence. If you want to match 10 words use `{1,9}` and the problem is solved. – 3limin4t0r Sep 20 '18 at 22:11
@Johan, keep in mind we are looking for a substring that starts at the beginning of the string, is comprised of full sentences and contains at least `10` words. – Cary Swoveland Sep 21 '18 at 04:28
@CarySwoveland That is what `(?:\S+\s+){1,9}(?:[^.!?]|\.(?!\s))*.` does. – 3limin4t0r Sep 21 '18 at 08:44

Cary Swoveland · Accepted Answer · 2018-09-20T21:14:24.903

r = /
    (?:           # begin a non-capture group
      (?:           # begin a non-capture group
        \p{Alpha}+  # match one or more letters
      |           # or
        \-?       # optionally match a minus sign
        (?:       # begin non-capture group
          \d+     # match one or more digits
        |         # or
          \d+     # match one or more digits
          \.      # match a decimal point
          \d+     # match one or more digits
        )         # end non-capture group
        %?        # optionally match a percentage character
      )           # end non-capture group
      [,;:.!?]?   # optionally ('?' following ']') match a punctuation char
      [ ]+        # match one or more spaces      
    )             # end non-capture group
    {9,}?         # execute the preceding non-capture group at least 14 times, lazily ('?')
    (?:           # begin a non-capture group
      \p{Alpha}+  # match one or more letters
      |           # or
      \-?         # optionally match a minus sign
        (?:       # begin non-capture group
          \d+     # match one or more digits
        |         # or
          \d+     # match one or more digits
          \.      # match a decimal point
          \d+     # match one or more digits
        )         # end non-capture group
      %?          # optionally match a percentage character
    )             # end non-capture group  
    [.!?]         # match one of the three punctuation characters
    (?!\S)        # negative look-ahead: do not match a non-whitespace char
    /x            # free-spacing regex definition mode

Let text equal the paragraph you wish to examine ("Therapy extract straw...end of the trial.")

Then

text[r]
  #=> "Therapy extract straw and chitosan from...the growth of algae Microcystis spp."

We can simplify the construction of the regex (and avoid duplicate bits) as follows.

def construct_regex(min_nbr_words)
  common_bits = /(?:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)/
  /(?:#{common_bits}[,;:.!?]? +){#{min_nbr_words},}?#{common_bits}[.!?](?!\S)/
end

r = construct_regex(10)
  #=> /(?:(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[,;:.!?]? +){10,}?(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[.!?](?!\S)/

This regex could be simplified if it were permitted to match nonsense words such as "ab2.3e%" or "2.3.2%". As presently defined, the regex will not match such words.

Thanks Cary incredible explanation. This works well as well. The only issue it has is with periods in acronyms such as Y.M.A. or Mr. Jones. It sees those as being the end of a sentence. — chell, Sep 21 '18 at 03:28
Regarding `"Y.M.A."`, suppose it were `"Y.M.A. Limited is..."`. We know the third period is not the end of a sentence because of the presence of the previous two. At the expense of considerable complexity that might be accomodated in the regular expression, but I see no way to deal with `"Mr. Jones"`. It's not really a limitation of regular expressions, but a logical problem. We, as humans, know that `"Mr."` does not end a sentence, but would be hard-pressed to explain why in words that could be translated to a regular expression. — Cary Swoveland, Sep 21 '18 at 04:22

Improve my Regex to include numbers that contain decimals and percentage signs

2 Answers2