Populating variable values in R dataframe based on string matching

Question

I have the following dataframe in R:

df <- data.frame(Sample_name = c("01_00H_NA_DNA",   "01_00H_NA_RNA",    "01_00H_NA_S",  "01_00H_NW_DNA",    "01_00H_NW_RNA",    "01_00H_NW_S",  "01_00H_OM_DNA",    "01_00H_OM_RNA",    "01_00H_OM_S",  "01_00H_RL_DNA",    "01_00H_RL_RNA",    "01_00H_RL_S"),
             Pair = c("","", "S1","","","S2","","","S3","", "","S5"))

I would like to generate a new variable Label such that similar strings in Sample_name until the last _ before DNA/RNA or S get matched to give a similar label Id number. While each row may not start with 01_00H, there will always be similar strings until the last underscore to group for the label variable.

Furthermore, I would like to also fill the pair variable with similar values, S1 for all identical labels and so on. The existing Pair values are not continuous i.e S3 is followed by S5 and so on.

Resulting dataframe will look something like this:

This has been incredibly hard to do, I followed How to create new column in dataframe based on partial string matching other column in R but it helped me only partially for direct 1:1 renaming.

Any solutions from useRs will be much appreciated, Thanks!

I'm not entirely sure if I understand what you wish to be matched - does `factor(tmp <- sub("(^.+)_(DNA|RNA|S)$", "\\1", df$Sample_name), labels=seq_along(unique(tmp)))` for instance work for your real data? — thelatemail, Mar 21 '17 at 01:32
Hi! Thanks, indeed the solution you provided works to create a new variable `Label` in my df How do I complete the second part of the question, where all identical labels get a corresponding `Pair` value (i.e the missing rows in `Pair` get the same `Pair` ID S1 or S2 or S3 as the one available value for that group in the original df? — Manasi Shah, Mar 21 '17 at 01:37

r2evans · Accepted Answer · 2017-03-21T13:56:49.987

1

Try this:

df$x <- gsub("_[^_]+$", "", df$Sample_name)
df$Label <- match(df$x, unique(df$x))
df$Pair <- ave(as.character(df$Pair), df$Label, FUN=max)
df$x <- NULL
df
#      Sample_name Pair Label
# 1  01_00H_NA_DNA   S1     1
# 2  01_00H_NA_RNA   S1     1
# 3    01_00H_NA_S   S1     1
# 4  01_00H_NW_DNA   S2     2
# 5  01_00H_NW_RNA   S2     2
# 6    01_00H_NW_S   S2     2
# 7  01_00H_OM_DNA   S3     3
# 8  01_00H_OM_RNA   S3     3
# 9    01_00H_OM_S   S3     3
# 10 01_00H_RL_DNA   S5     4
# 11 01_00H_RL_RNA   S5     4
# 12   01_00H_RL_S   S5     4

Or using dplyr:

library(dplyr)
df %>%
  mutate(
    x = gsub("_[^_]+$", "", Sample_name),
    Label = match(x, unique(x))
  ) %>%
  select(-x) %>%
  group_by(Label) %>%
  mutate(Pair = paste0(Pair, collapse = "")) %>%
  ungroup()
# # A tibble: 12 × 3
#      Sample_name  Pair Label
#           <fctr> <chr> <int>
# 1  01_00H_NA_DNA    S1     1
# 2  01_00H_NA_RNA    S1     1
# 3    01_00H_NA_S    S1     1
# 4  01_00H_NW_DNA    S2     2
# 5  01_00H_NW_RNA    S2     2
# 6    01_00H_NW_S    S2     2
# 7  01_00H_OM_DNA    S3     3
# 8  01_00H_OM_RNA    S3     3
# 9    01_00H_OM_S    S3     3
# 10 01_00H_RL_DNA    S5     4
# 11 01_00H_RL_RNA    S5     4
# 12   01_00H_RL_S    S5     4

Edit: added @thelatemail's use of ave, better by codegolf and stability.

edited Mar 21 '17 at 13:56

answered Mar 21 '17 at 01:39

r2evans

141,215
6
77
149

1

`ave` will be safer than `unlist(by(...`, which I think relies on the dataset being ordered. – thelatemail Mar 21 '17 at 01:47
Personally, I prefer the `dplyr` solution. I'm not as fluent with `ave` ... – r2evans Mar 21 '17 at 02:07
1

Just `ave(as.character(df$Pair), df$Label, FUN=max)` will do it. – thelatemail Mar 21 '17 at 02:20
Hi, thanks, I upvoted and accepted your answer but have a follow up question. What if I were to generate another variable where from the original variable value `01_00H_NA_DNA` I now want to make a new variable based on combining similar 1st and 3rd parts of the above string i.e all the `01_NA` values be called fix1 in a new variable called `new`. How would the `gsub` expression change for that? Thanks! – Manasi Shah Mar 23 '17 at 18:24
If you are certain that the strings will always have the same structure of 3+ groups separated by underscore, then you can use `mutate(df, name2 = gsub("^([^_]+)_[^_]+_([^_]+)_?.*$", "\\1_\\2", name))`. (The last portion, `_?.*$`, handles a third or more underscore, leading to a fourth or more groups.) An alternative is to use `strsplit`, but that takes some more work since it returns lists, and is much less direct than this and is therefore a bit slower. – r2evans Mar 24 '17 at 16:45

Populating variable values in R dataframe based on string matching

1 Answers1