match all occurrences in data frame

Question

I'm trying to do something similar as in this post here: Extract rows for the first occurrence of a variable in a data frame but extract all occurrences, not just the first.

Here is a simplified example: I have this data frame called toDrop

Gene   Taxa
123    A
327    B
445    D
557    A
789    E
123    B
557    C

Here's my code that uses match and thus returns the first match only. I'm running this inside a loop so modifying things here for simplicity.

Gene <- c("123", "327", "445", "557", "789", "123", "557")
Taxa <- c("A", "B", "D", "A", "E", "B", "C")
toDrop <- data.frame(Gene, Taxa)
Temp <- list()
geneNameTemp <- "123"
toDrop[match(geneNameTemp, toDrop$Gene), 2] -> Temp

In this example, Temp should return a list of "A" and "B" I think I need to use lapply as in this post but can't figure it out from that example. Thanks for the help.

DavidLukeThiessen · Accepted Answer · 2020-07-28T18:54:08.817

There are several ways to do this. One way in base R that is close to what you've already got is which() combined with %in%

Gene <- c("123", "327", "445", "557", "789", "123", "557")
Taxa <- c("A", "B", "D", "A", "E", "B", "C")
toDrop <- data.frame(Gene, Taxa)
Temp <- list()
geneNameTemp <- "123"
Temp <- as.list(toDrop[which(toDrop$Gene %in% geneNameTemp),2])
Temp
# [[1]]
# [1] A
# Levels: A B C D E
# 
# [[2]]
# [1] B
# Levels: A B C D E

Will return a list with the two factors. This method can be expanded to vector geneNameTemp, but it will include duplicates if there are any

Gene <- c("123", "327", "445", "557", "789", "123", "557")
Taxa <- c("A", "B", "D", "A", "E", "B", "C")
toDrop <- data.frame(Gene, Taxa)
Temp <- list()
geneNameTemp <- c("123", "327")
Temp <- as.list(toDrop[which(toDrop$Gene %in% geneNameTemp),2])
Temp
# [[1]]
# [1] A
# Levels: A B C D E
# 
# [[2]]
# [1] B
# Levels: A B C D E
# 
# [[3]]
# [1] B
# Levels: A B C D E

If you only need a vector with the factors you can remove as.list(). If you want to remove the duplicates you can use unique(toDrop[which(toDrop$Gene %in% geneNameTemp),2]).

Excellent! This is what I was looking for. One edit though it should read ```toDrop[which(toDrop$Gene %in% geneNameTemp),2]``` You need the dataframe before Gene. Thanks so much. — KNN, Apr 17 '20 at 17:35

linog · Answer 2 · 2020-04-16T21:59:07.290

0

Gene <- c("123", "327", "445", "557", "789", "123", "557")
Taxa <- c("A", "B", "D", "A", "E", "B", "C")
toDrop <- data.frame(Gene, Taxa, stringsAsFactors = FALSE)

Many ways to do that. With data.table it is easy to split a column by value and return a list. Since you are only interested in Taxa column, you can do the following:

library(data.table)
lapply(
  split(setDT(toDrop), by = "Gene"), function(d) d[['Taxa']]
)

$`123`
[1] "A" "B"

$`327`
[1] "B"

$`445`
[1] "D"

$`557`
[1] "A" "C"

$`789`
[1] "E"

edited Apr 16 '20 at 21:59

answered Apr 16 '20 at 18:24

linog

5,786
3
14
28

Okay, but I'm still confused how I use your answer in a loop using geneNameTemp as in my example? – KNN Apr 16 '20 at 18:33
I don't know since you did not gave a reproducible example of what you want to do after. Making a subset afterwards is not precise enough to be able to help you. What is the structure of the other dataframe ? – linog Apr 16 '20 at 18:40
Downstream should not apply to the question as written. Given my data frame provided (toDrop) how can I use geneNameTemp to match to toDrop$Gene get the list of "A" and "B"? Not just the first occurrence using match? – KNN Apr 16 '20 at 18:44

match all occurrences in data frame

2 Answers2