5

I wish to know why I obtain two different output strings by using gsub and stringi. Does the metacharacter "." not include new lines in stringi? Does stringi read "line by line"?

By the way I did not find any way to perform the "correct" substitution with stringi so I needed to use gsub here.

string <- "is it normal?\n\nhttp://www.20minutes.fr"

> gsub(" .*?http"," http", string)
[1] "is http://www.20minutes.fr"

> stri_replace_all_regex(string, " .*?http"," http")
[1] "is it normal?\n\nhttp://www.20minutes.fr"
Dario Lacan
  • 1,099
  • 1
  • 11
  • 25
  • 4
    Try `stri_replace_all_regex(string, " .*?http"," http", opts_regex = stri_opts_regex(dotall = TRUE))`. – lukeA Apr 15 '15 at 09:47
  • @lukeA I think you could post the comment as an answer – akrun Apr 15 '15 at 10:07
  • yep. By the way also this works: `stri_replace_all_regex(string, "(?s) .*?http"," http")` By the way I consider this behaviour weird! – Dario Lacan Apr 15 '15 at 10:28

2 Answers2

3

One way would be to set . to also match line terminators instead of stopping at a line:

stri_replace_all_regex(string, " .*?http"," http", 
                       opts_regex = stri_opts_regex(dotall = TRUE))
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • 1
    Do you know why they changed the standard R (posix) regex behaviour? Is it in Perl that the dot does not match new lines? – Dario Lacan Apr 15 '15 at 18:08
2

By default -- for historical reasons, see this tutorial -- in most regex engines a dot doesn't match a newline character. As @lukeA suggested, to match a newline you may set dotall option to TRUE in stringi regex-based functions.

By the way, gsub(..., perl=TRUE) gives results consistent with stringi.

gagolews
  • 12,836
  • 2
  • 50
  • 75