R - Identifying only strings ending with A and B in a column

Question

I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:

df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))

What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.

The end result of what I'm looking for would look like this: "S_026" and "S_028" as these are the only ones that have A and B at the end.

All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.

Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.

Any suggestions?

score 1 · Accepted Answer · answered Aug 07 '21 at 23:28

1

We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring

library(dplyr)
df %>% 
   group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>% 
   filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>% 
   ungroup %>% 
   select(-grp)

-output

# A tibble: 5 x 1
  Samples
  <chr>  
1 S_026A 
2 S_026B 
3 S_028A 
4 S_028B 
5 S_026B

answered Aug 07 '21 at 23:28

akrun

874,273
37
540
662

Thank-you very much - you've just saved me many hours of work. However, I might add that S_026B is appearing twice. This is happening with my data too. To remedy this, I am using a `sort(unique(Samples))` at the end of the code to get just the ones that appear with both A's and B's at the end. – Purrsia Aug 07 '21 at 23:51
Is there a way, however, to only get the answers of S_026 and S_028 and not this list of just A's and B's? – Purrsia Aug 07 '21 at 23:57
1

@Purrsia Just remove the `select(-grp)` and use `%>% distinct(grp)` – akrun Aug 08 '21 at 18:58

score 1 · Answer 2 · answered Aug 08 '21 at 05:09

You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.

library(dplyr)
library(tidyr)

df %>%
  extract(Samples, c('value', 'last'), '(.*)(.)') %>%
  group_by(value) %>%
  filter(all(c('A', 'B') %in% last)) %>%
  ungroup %>%
  distinct(value)

#  value
#  <chr>
#1 S_026
#2 S_028

R - Identifying only strings ending with A and B in a column

2 Answers2