0
set.seed(1)

example <- paste0(
  c("A","B")[sample(1:2,size = 100,replace = TRUE)],
  sample(1:9999,100,replace=TRUE),
  c("A","B","C")[sample(1:3,size = 100,replace = TRUE)],
  sample(1:12,100,replace=TRUE)
)

strsplit(
  sub(pattern = "^(A|B)([0-9]{1,4})(A|B|C)([0-9]{1,2})$",
    replacement = "\\1 \\2 \\3 \\4",
    x = example),
  split = " ",
  fixed = TRUE)

I want to do the same thing that I've done there, ie choosing some rigid regex groups and splitting between these groups.

But I want a one-line code in base R : can you do the same thing using only strsplit and and a regexp. That is, without adding delimiters and then splitting with theses delimiters.

Arnaud Feldmann
  • 761
  • 5
  • 17

1 Answers1

1

We can use a regex lookaround in base R strsplit to split between a digit/non-digit or a non-digit/digit

out2 <- strsplit(example, "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", perl = TRUE)

If it should be specific for 'A', 'B', 'C'

out2 <- strsplit(example, "(?<=[A-C])(?=\\d)|(?<=\\d)(?=[A-C])", perl = TRUE)

-checking with OP's output

identical(out1, out2)
#[1] TRUE

Or may be

strsplit(example, '(?<=^[A-B])(?=[0-9]{1,4})|(?<=[A-C])(?=[0-9]{1,2})|(?<=\\d)(?=[A-C])', perl = TRUE)

Or with extract from tidyr

library(tidyr)
extract(tibble(example), example, 
  into = paste0('v', 1:4), "^(A|B)([0-9]{1,4})(A|B|C)([0-9]{1,2})$", 
     convert = TRUE) 

Or with read.table from base R

read.table(text = sub(pattern = "^(A|B)([0-9]{1,4})(A|B|C)([0-9]{1,2})$", 
      replacement = "\\1 \\2 \\3 \\4", x = example), header = FALSE)

or with strcapture from base R

strcapture("^(A|B)([0-9]{1,4})(A|B|C)([0-9]{1,2})$", 
   example, proto = list(v1 = character(), v2 = numeric(), 
     v3 = character(), v4 = numeric()))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. It is a beautiful approach, through it isn't totally as specific as a split between predefined groups – Arnaud Feldmann Apr 22 '21 at 20:58
  • 1
    @ArnaudFeldmann when you say predefined groups, is it the capture groups in your code – akrun Apr 22 '21 at 21:00
  • 1
    @ArnaudFeldmann or do you want `extract(tibble(example), example, into = paste0('v', 1:4), "^(A|B)([0-9]{1,4})(A|B|C)([0-9]{1,2})$", convert = TRUE)` – akrun Apr 22 '21 at 21:03
  • 1
    @ArnaudFeldmann when you define those groups itself it takes some space. So, not sure how it will be a single line `read.table(text = sub(pattern = "^(A|B)([0-9]{1,4})(A|B|C)([0-9]{1,2})$", replacement = "\\1 \\2 \\3 \\4", x = example), header = FALSE)` – akrun Apr 22 '21 at 21:04
  • the extract thing with tidyr is exactly what I want as a function (I didn't know it so thanks) through I wanted a base way to do it. Thanks anyway, that's a great add – Arnaud Feldmann Apr 22 '21 at 21:13
  • 1
    @ArnaudFeldmann you can use `strcapture` in base R – akrun Apr 22 '21 at 21:17
  • 1
    Thanks ! Problem solved then ! I just did it – Arnaud Feldmann Apr 22 '21 at 21:21