0

I have the following character vector than I need to modify with gsub.

strings <- c("x", "pm2.5.median", "rmin.10000m", "rmin.2500m", "rmax.5000m")

Desired output of filtered strings:

"x", "pm2.5.median", "rmin", "rmin", "rmax"

My current attempt works for everything except the pm2.5.median string which has dots that need to be preserved. I'm really just trying to remove the buffer size that is appended to the end of each variable, e.g. 1000m, 2500m, 5000m, 7500m, and 10000m.

gsub("\\..*m$", "", strings)
"x", "pm2", "rmin", "rmin", "rmax"
philiporlando
  • 941
  • 4
  • 19
  • 31

2 Answers2

4

Match a dot, any number of digits, m and the end of string and replace that with the empty string. Note that we prefer sub to gsub here because we are only interested in one replacement per string.

sub("\\.\\d+m$", "", strings)
## [1] "x"            "pm2.5.median" "rmin"         "rmin"         "rmax"   
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
3

The .* pattern matches any 0 or more chars, as many as possible. The \..*m$ pattern matches the first (leftmost) . in the string and then grab all the text after it if it ends with m.

You need

> sub("\\.[^.]*m$", "", strings)
[1] "x"            "pm2.5.median" "rmin"         "rmin"         "rmax" 

Here, \.[^.]*m$ matches ., then 0 or more chars other than a dot and then m at the end of the string.

See the regex demo.

Details

  • \. - a dot (must be escaped since it is a special regex char otherwise)
  • [^.]* - a negated character class matching any char but . 0 or more times
  • m - an m char
  • $ - end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Wiktor can you tell why you used this pattern `[^.]*` instead of `\\d+` as in G.Grothendieck's answer? I.e. what is the advantage here? (I am a regex beginner still) – markus Feb 15 '19 at 21:30
  • 1
    @markus Judging by the current sample input data the strings are period-separated strings. The `.` that should be matched is the last dot in the string. So, the point is to match the dot that has no dots up to the end of the string, and `[^.]*` fits this need. It is preferable if there may be other chars that just digits between the last `.` and final `m`. – Wiktor Stribiżew Feb 15 '19 at 21:33