String Splitting and Truncating using Regular Expressions in R

Question

I am looking for help implementing a function in R to truncate a level_stream string vector of my dataframe in R and haven't had much luck yet. Essentially when a row in the pre_quiz_score column is not NA, I want to truncate the beginning part of the string up until (and including) the first | character, and I want to truncate everything past the last | character if a post_quiz_score is not NA for that row.

df <- data.frame(ls = c('123 L0=38/42|425 L0=40/42', NA, '482 L7=7/12|789 L8=5/6|523 L9=2/6'), 
                 pre_quiz_score = c(88, NA, 12), 
                 post_quiz_score = c(NA, NA, 90))

I want to implement this in a "tidyverse" way and vectorized to get something like

----------------------------------------------------------------------------
|                 ls                  | pre_quiz_score | post_quiz_score   |
| 425 L0=40/42                        | 88             | NA                |
| NA                                  | NA             | NA                |
| 789 L8=5/6                          | 12             | 90                |

So far, I haven't gotten stringr::str_split, gsub, or sub to work correctly, mostly because I end up removing just the |'s or all the string but the last | and after.

I hope that makes sense, thanks!

score 4 · Answer 1 · answered Dec 20 '16 at 04:47

We can use sub from base R

df$ls <- sub("^[^|]+\\|([^|]+).*", "\\1", df$ls)
df
#            ls pre_quiz_score post_quiz_score
#1 425 L0=40/42             88              NA
#2         <NA>             NA              NA
#3   789 L8=5/6             12              90

Explanation

We match one or more characters that are not a | ([^|]+) from the start (^) of the string, followed by a | (escape it -\\| as a it is a metacharacter), then capture one or more characters that are not a | as a group (i.e. inside the parentheses ([^|]+)) followed by characters until the end of the string (.*) and replace it with the backreference of the captured group (\\1 - as there is only a single capture group and it is the first one, we denote it by 1)

or `gsub("\\|([^|]+)|.", "\\1", df$ls)` but it might require more explanation :} — rawr, Dec 20 '16 at 04:54

score 3 · Accepted Answer · answered Dec 20 '16 at 03:49

Just implement the logic as you stated it:

library(stringi)
library(dplyr)

df <- data.frame(ls = c('123 L0=38/42|425 L0=40/42', NA, '482 L7=7/12|789 L8=5/6|523 L9=2/6'),
                 pre_quiz_score = c(88, NA, 12),
                 post_quiz_score = c(NA, NA, 90),
                 stringsAsFactors=FALSE)


df %>%
  mutate(ls=ifelse(!is.na(pre_quiz_score),
                   stri_replace_first_regex(ls, "^[[:alnum:][:blank:]=/]+\\|", ""), ls),
         ls=ifelse(!is.na(post_quiz_score),
                   stri_replace_last_regex(ls, "\\|[[:alnum:][:blank:]=/]+$", ""), ls))
##             ls pre_quiz_score post_quiz_score
## 1 425 L0=40/42             88              NA
## 2         <NA>             NA              NA
## 3   789 L8=5/6             12              90

This is exactly what I was trying to implement! Thanks for introducing me to these useful stringi functions. I'm not an expert on regular expressions, so I'll read about yours, but this worked. =) — Brian Becker, Dec 20 '16 at 04:43
Cool. Ticking the answer button wld help others know this was accurate. — hrbrmstr, Dec 20 '16 at 13:23

joel.wilson · Answer 3 · 2016-12-20T06:40:28.273

2

library(dplyr)
df %>% mutate(ls = sapply(strsplit(df$ls, "\\|"), function(x) x[2]))

#            ls pre_quiz_score post_quiz_score
#1 425 L0=40/42             88              NA
#2         <NA>             NA              NA
#3   789 L8=5/6             12              90

edited Dec 20 '16 at 06:40

answered Dec 20 '16 at 04:13

joel.wilson

8,243
5
28
48

This doesn't give the right answer - the first `ls` should start `425` – thelatemail Dec 20 '16 at 06:12
@thelatemail i had to extract the `x[2]` instead of `x[1]` – joel.wilson Dec 20 '16 at 06:41

Jonathan Carroll · Answer 4 · 2016-12-20T22:27:53.323

0

tidyr::separate() allows you to split up a column into sub-columns. With the extra = "drop" argument it will keep only up to length(into) columns.

library(tidyr)
separate(df, ls, c("remove", "keep"), sep="\\|", extra = "drop")

#>         remove         keep pre_quiz_score post_quiz_score
#> 1 123 L0=38/42 425 L0=40/42             88              NA
#> 2         <NA>         <NA>             NA              NA
#> 3  482 L7=7/12   789 L8=5/6             12              90

I've kept the remaining part after the first | but you can remove that too if you don't need it.

edited Dec 20 '16 at 22:27

answered Dec 20 '16 at 04:29

Jonathan Carroll

3,897
14
34

I've been using `tidyr::separate` to do some future parsing on this "level stream" data, but this isn't quite the output I needed, and some of these `ls` strings are 2k+ character strings. Once I trim up the first and last 'parts' of `ls` then I use `tidyr::separate` with its default arguments to start converting data types and parse the timestamps appropriately. – Brian Becker Dec 20 '16 at 04:48

String Splitting and Truncating using Regular Expressions in R

4 Answers4

Explanation