1

I have the following dataframe, and I need to manipulate column a to get to column a_clean:

df=data.frame(a=c("1234-12;23456-123","12345-1234",NA,"1234-013;1234-014"),a_clean=c("01234-0012;23456-0123","12345-1234",NA,"1234-0013;1234-0014"))

I need to pad the numbers before the hyphen so it's five digits and after the hyphen so it's 4 digits.

I don't want to separate a to different rows, and then concat back together. My dataframe is very big and I want to do the string manipulation as fast as possible.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Ashti
  • 193
  • 1
  • 10

2 Answers2

1

gsubfn is like gsub except the replacement argument is a function which inputs the capture groups (matches to the portions of the regular expression within parentheses) as separate arguments. The entire match is then replaced with the output of the function. This matches each of the strings of digits and then passes them as x and y to the function expressed in formula notation where they are converted to numeric and sprintf adds 0's.

If you are using dplyr replace transform with mutate.

library(gsubfn)

transform(df, clean = 
  gsubfn("(\\d+)-(\\d+)", ~ sprintf("%05d-%04d", as.numeric(x), as.numeric(y)), a))

giving

                  a               a_clean                 clean
1 1234-12;23456-123 01234-0012;23456-0123 01234-0012;23456-0123
2        12345-1234            12345-1234            12345-1234
3              <NA>                  <NA>                    NA
4 1234-013;1234-014   1234-0013;1234-0014 01234-0013;01234-0014
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

A base R solution, using strsplit to get the ; separated, then gsub to access the - strings, replaceing the NAs, finally unsing paste with Map to construct the result.

data.frame(df, a_clean_new = unlist(Map(paste, collapse=";", 
  lapply(strsplit(df$a, ";"), function(x){
    res <- paste0(sprintf("%05d", as.numeric(gsub("-.*", "", x))), "-", 
             sprintf("%04d", as.numeric(gsub(".*-", "", x))))
    replace(res, grep("NA", res), NA)}))))
                  a               a_clean           a_clean_new
1 1234-12;23456-123 01234-0012;23456-0123 01234-0012;23456-0123
2        12345-1234            12345-1234            12345-1234
3              <NA>                  <NA>                    NA
4 1234-013;1234-014   1234-0013;1234-0014 01234-0013;01234-0014
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29