Extract specific portion of a string and paste to a new column in R

Question

I have the following dataframe with a string column and I want to extract T,N,M,G,L status (and so on..) for each observation into separate new columns including their respective prefix and suffix. I have tried the grep() and strsplit function but the resulting columns have differing number of rows due to NA values and it doesn't seem to work. I'm not an expert in coding and I'd really appreciate your support for a working script. Thanks in advance.

df <- data.frame(input="cT1b;cN1a;cM0;G3",
        "pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
        "cT3;cN0;M0")

The expected output should look like

df <- data.frame(input=c("cT1b;cN1a;cM0;G3",
             "pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
             "cT3;cN0;M0" ),
             T_output=c("cT1b","pT1a","cT3"),
             G_output=c("G3","G1",NA),
             L_output=c(NA,"L0",NA))

What is your expected output? It sounds as if this is actually a column of a larger object (perhaps a `data.frame`), it would help to know the exact structure you expect out of this, can you provide a literal `data.frame(input=c("cT1b;cN1a;cM0;G3","pT1a;pN0;cM0;G1;L0;V0;Pn0;R0"), newcolumn=c(.......))` (replacing `newcolumn` or perhaps multiple columns with what you expect from those two input values). — r2evans, Aug 02 '22 at 16:02
Dear @r2evans, I expect to have an output that should look like this including the NA's for respective rows where the respective variable is not present. Thanks a lot! data.frame(input=c("cT1b;cN1a;cM0;G3","pT1a;pN0;cM0;G1;L0;V0;Pn0;R0", "cT3;cN0;M0" ), T_status=c("cT1b", "pT1a","cT3"), G_status=c("G3", "G1", NA), L_status=c(NA, "L0", NA)) — Nikhil Kalra, Aug 02 '22 at 16:24
BTW, sorry about being late on this ... Welcome to SO, Nikhil Kalra! It's generally best to put things like that in the question itself, since comments can be skipped by readers and/or hidden by the Stack interface. Please [edit] your question and add that as a code block (see https://stackoverflow.com/editing-help and https://meta.stackexchange.com/a/22189 for formatting). Thanks! — r2evans, Aug 02 '22 at 16:25

r2evans · Answer 1 · 2022-08-02T17:28:36.480

grep is typically for finding (true/false) strings or occasionally returning whole strings that contain a substring (value=TRUE), but not for extracting substrings from a whole string. For that, one might look into sub/gsub or gregexpr or stringr::str_extract/str_extract_all for extracting substrings. However, I think that's not the best (well, certainly not the only) approach.

Try this:

library(dplyr)
dat %>%
  select(input) %>%
  mutate(
    bind_rows(lapply(
      strsplit(input, ";"),
      function(S) as.data.frame(lapply(setNames(nm = c("T", "G", "L")), 
                                function(z) paste0(grep(pattern = z, x = S, value = TRUE), collapse = ";"))))),
    across(one_of(c("T","G","L")), ~ ifelse(nzchar(.), ., .[NA]))
  )
#                          input    T    G    L
# 1             cT1b;cN1a;cM0;G3 cT1b   G3 <NA>
# 2 pT1a;pN0;cM0;G1;L0;V0;Pn0;R0 pT1a   G1   L0
# 3                   cT3;cN0;M0  cT3 <NA> <NA>

Note: it is obviously doing nothing with the M or N substrings, which might be intentional or whatever. If you want them too, you can do setNames(nm=c("T","G","L","N")) (and again the second time within one_of) to get another upper-letter column.

Data

dat <- structure(list(input = c("cT1b;cN1a;cM0;G3", "pT1a;pN0;cM0;G1;L0;V0;Pn0;R0", "cT3;cN0;M0")), class = "data.frame", row.names = c(NA, -3L))

Thanks a ton!! @r2evans. Works like a charm. No doubt, my problem is solved, but is there a less complicated way to achieve the same result? The code seems a little complicated for a novice like me. — Nikhil Kalra, Aug 03 '22 at 08:14
You could always put a portion of that in a user-defined function, though that doesn't reduce the complexity, it just moves it. I don't think it's that complex, though it is caught in a parenthesis-storm of sorts; some of this is to fit within class-expectations of dplyr, but most of it is because of your expected output, and that the matches can be "0 or more", so corner-cases must be addressed. You can remove the `across(..)` if you don't mind having empty strings `""` instead of `NA`, reducing the code a little — r2evans, Aug 03 '22 at 11:31

Extract specific portion of a string and paste to a new column in R

1 Answers1