1

I have a text from which I want to extract the first two paragraphs. The text consists of several paragraphs seperated by empty lines. The paragraphs themselves can contain line breaks. What I want to extract is everything from the beginning of the text until the second empty line. This is the original text:

Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.

Buy me a Beer: https://www.buymeacoffee.com/johnnyfd

Support the GoFundMe: http://gofundme.com/f/send-money-dire...

Follow Me: 

The text I want to have is:

Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.

Buy me a Beer: https://www.buymeacoffee.com/johnnyfd

I tried to create a regular expression doing the job and I though the following seemed to be a possible solution:

(.*|\n)*(?:[[:blank:]]*\n){2,}(.*|\n)*(?:[[:blank:]]*\n){2,}

When I use it in R in stri_extract_all_regex, I receive the following error:

Error in stri_extract_all_regex(video_desc_orig, "(.*|\n)*?(?:[[:blank:]]*\n){2,}(.*?|\n)*(?:[[:blank:]]*\n){2,}") : 
  Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)

It's the first time for me using Regex and I really don't know how to interpret this error. Any help appreciated ;)

talocodat
  • 25
  • 5
  • 2
    Overflows are often because the regex engine finds more than one way to match a string, and backtracks excessively. Indeed, `(.*|\n)*` overlaps with what `(?:[[:blank:]]*\n){2,}` matches, and then when the regular expression engine ends up in a place where it can't find a match, it goes back and tries if there's a different way to reassemble the match at the beginning which lets it proceed. You'll want to refactor the subexpressions so that they can never match the same string. – tripleee Dec 13 '22 at 18:27
  • I'm thinking the first subexpression should be forced to match at least one non-blank, something like `(.*[^[:blank:]].*\n)*` perhaps. – tripleee Dec 13 '22 at 18:30

2 Answers2

2

You have nested quantifiers like (.*|\n)* which creates a lot of paths to explore. This pattern for example first matches all text, and then starts to backtrack to fit in the next parts of the pattern.

Including the last 2 newlines, making sure that the lines contain at least a single non whitespace character:

\A[^\S\n]*\S.*(?:\n[^\S\n]*\S.*)*\n{2,}[^\S\n]*\S.*(?:\n[^\S\n]*\S.*)*

Explanation

  • \A Start of string
  • [^\S\n]*\S.* Match a whole line with at least a single non whitespace char
  • (?:\n[^\S\n]*\S.*)* Optionally repeat all following lines that contain at least a single non whitespace chars
  • \n{2,} Match 2 or more newlines
  • [^\S\n]*\S.*(?:\n[^\S\n]*\S.*)* Same as the previous pattern to match the lines for the second paragraph

See a regex demo and a R demo.

Example

library(stringi)

string <- 'Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.

Buy me a Beer: https://www.buymeacoffee.com/johnnyfd

Support the GoFundMe: http://gofundme.com/f/send-money-dire...

Follow Me: '


stri_extract_all_regex(
  string,
  '\\A[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*\\n{2,}[^\\S\\n]*\\S.*(?:\\n[^\\S\\n]*\\S.*)*'
)

Output

[[1]]
[1] "Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.\nThen I went to a nice restaurant with them.\n\nBuy me a Beer: https://www.buymeacoffee.com/johnnyfd"
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Sorry but this REGEX isn't doint the job with every text. There are selected more than twp paragraphs and the also paragraphs appearing at the bottom of the text. Here an example: https://regex101.com/r/Zebn3n/1. I really want just the first two paragraphs (containing line breaks) and nothing more. – talocodat Dec 14 '22 at 10:27
  • Where are there more than 2 paragraphs selected? If you want the last lines as well, you can use the pattern at the end of the answer with the lookahead. See https://regex101.com/r/auefcC/1 – The fourth bird Dec 14 '22 at 10:29
  • @talocodat This matches the first 2 paragraphs https://regex101.com/r/GQrPp6/1 – The fourth bird Dec 14 '22 at 10:30
  • Is it possible to make this REGEX work too, if the the paragraphs are seperated by 1 OR MORE empty lines? – talocodat Dec 14 '22 at 11:20
  • @talocodat But of course :-) I have updated the answer with an explanation https://regex101.com/r/Hyq0tL/1 – The fourth bird Dec 14 '22 at 11:36
  • But why isn't this one working?https://regex101.com/r/s9MicJ/1 – talocodat Dec 14 '22 at 12:38
  • @talocodat Because there is a space on line 3 and the pattern matches 2 newlines. Then you could write it like this https://regex101.com/r/n9Vc60/1 – The fourth bird Dec 14 '22 at 12:47
0

In R you need to do double slashes \\.

string <- 'Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.

Buy me a Beer: https://www.buymeacoffee.com/johnnyfd

Support the GoFundMe: http://gofundme.com/f/send-money-dire...

Follow Me: '

library(stringr)

string |>
str_extract('(.*|\\n)*(?:[[:blank:]]*\\n){2,}(.*|\\n)*(?:[[:blank:]]*\\n){2,}') |>
cat()

# Output
Today I meet my friends in Kyiv to celebrate my new permanent residency status in Ukraine.
Then I went to a nice restaurant with them.

Buy me a Beer: https://www.buymeacoffee.com/johnnyfd

Bensstats
  • 988
  • 5
  • 17