3

Example:

 df <- data.frame(Name = c("J*120_234_458_28", "Z*23_205_a834_306", "H*_39_004_204_99_04902"))

I would like to be able to select everything before the third underscore for each row in the dataframe. I understand how to split the string apart:

df$New <- sapply(strsplit((df$Name),"_"), `[`)

But this places a list in each row. I've thus far been unable to figure out how to use sapply to unlist() each row of df$New select the first N elements of the list to paste/collapse them back together. Because the length of each subelement can be distinct, and the number of subelements can also be distinct, I haven't been able to figure out an alternative way of getting this info.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
Jay
  • 442
  • 1
  • 5
  • 13

1 Answers1

2

We specify the 'n', after splitting the character column by '_', extract the n-1 first components

 n <- 4
 lapply(strsplit(as.character(df$Name), "_"), `[`, seq_len(n - 1))

If we need to paste it together, can use anonymous function call (function(x)) after looping over the list with lapply/sapply, get the first n elements with head and paste them together`

sapply(strsplit(as.character(df$Name), "_"), function(x) 
          paste(head(x, n - 1), collapse="_"))
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"   

Or use regex method

sub("^([^_]+_[^_]+_[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004" 

Or if the 'n' is really large, then

pat <- sprintf("^([^_]+){%d}[^_]+).*", n-1)
sub(pat, "\\1", df$Name) 

Or

sub("^(([^_]+_){2}[^_]+)_.*", "\\1", df$Name)
#[1] "J*120_234_458" "Z*23_205_a834" "H*_39_004"    
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you, this is what I needed (still have a timer preventing me from accepting the answer). For some reason I have such an issue wrapping my head around complex uses of sapply and lapply. – Jay Mar 02 '20 at 23:05