I have two lists coming from columns in a dataframe (>=300 000 entries), with the first list giving a unique identifier from a data set, and the second list giving linked identifiers taken from the first list, e.g.
ids <- list("a1", "a2", "a3", "a4", "a5", "a6")
linked.ids <- list("a1", "a1", "a3", "a4", "a4", "a4")
I need to create a third list (linked.flag
) to be appended the dataframe that will be blank if the corresponding entry from linked.ids
has only one match in the list linked.ids
(i.e. it is only linked with itself), and the entry "Linked" if there is more than one match in the list linked.ids
. In the above example, the desired outcome would be
I am searching for an efficient way to perform this operation. Here is my current solution:
library("stringr")
ids <- c("a1", "a2", "a3", "a4", "a5", "a6")
linked.ids <- c("a1", "a1", "a3", "a4", "a4", "a4")
indices <- 1:length(ids)
count.matches <- function(i1, i2) sum(str_count(linked.ids[i1], linked.ids[i2]))
counts <-sapply(indices, FUN=function(x2) sapply(indices, function(x1) count.matches(x1,x2)))
counts <- rowSums(counts)
assign.flag <- function(x) if(counts[x] > 1){"Linked"}else{""}
linked.flag <- sapply(indices, FUN=assign.flag)
df <- data.frame(IDs = ids, Links = linked.ids, LinkFlag = linked.flag)
which gives as output
IDs Links LinkFlag
1 a1 a1 Linked
2 a2 a1 Linked
3 a3 a3
4 a4 a4 Linked
5 a5 a4 Linked
6 a6 a4 Linked
My current solution is an adaptation of the accepted answer from R count times word appears in element of list
I am relatively new to R, and would be grateful for a more efficient solution (coding style suggestions also appreciated).
Thank you!