0

I have two lists coming from columns in a dataframe (>=300 000 entries), with the first list giving a unique identifier from a data set, and the second list giving linked identifiers taken from the first list, e.g.

ids <- list("a1", "a2", "a3", "a4", "a5", "a6")
linked.ids <- list("a1", "a1", "a3", "a4", "a4", "a4")

I need to create a third list (linked.flag) to be appended the dataframe that will be blank if the corresponding entry from linked.ids has only one match in the list linked.ids (i.e. it is only linked with itself), and the entry "Linked" if there is more than one match in the list linked.ids. In the above example, the desired outcome would be

I am searching for an efficient way to perform this operation. Here is my current solution:

library("stringr")

ids <- c("a1", "a2", "a3", "a4", "a5", "a6")
linked.ids <- c("a1", "a1", "a3", "a4", "a4", "a4")

indices <- 1:length(ids)

count.matches <- function(i1, i2) sum(str_count(linked.ids[i1], linked.ids[i2]))
counts <-sapply(indices, FUN=function(x2) sapply(indices, function(x1) count.matches(x1,x2))) 
counts <- rowSums(counts)

assign.flag <- function(x) if(counts[x] > 1){"Linked"}else{""}
linked.flag <- sapply(indices, FUN=assign.flag)

df <- data.frame(IDs = ids, Links = linked.ids, LinkFlag = linked.flag)

which gives as output

    IDs Links   LinkFlag
1   a1  a1  Linked
2   a2  a1  Linked
3   a3  a3   
4   a4  a4  Linked
5   a5  a4  Linked
6   a6  a4  Linked

My current solution is an adaptation of the accepted answer from R count times word appears in element of list

I am relatively new to R, and would be grateful for a more efficient solution (coding style suggestions also appreciated).

Thank you!

Community
  • 1
  • 1
Pavel L
  • 5
  • 3
  • A `data.table` solution: `library(data.table); dt <- data.table(IDs=unlist(ids), Links=unlist(linked.ids), key="Links"); dt[dt[, .N, by=Links][, LinkFlag:=ifelse(N>1, "linked", "")][, list(Links, LinkFlag)]]`. However, I dunno if `ifelse` kills the performance here. – lukeA Feb 10 '14 at 11:53

2 Answers2

1

Here's a way to create the data frame:

within(data.frame(IDs = unlist(ids),
                  Links = unlist(ids[match(linked.ids, ids)])),
       LinkFlag <- ave(seq_along(Links), Links, FUN = function(x)
         if(length(x) > 1) "Linked" else ""))


  IDs Links LinkFlag
1  a1    a1   Linked
2  a2    a1   Linked
3  a3    a3         
4  a4    a4   Linked
5  a5    a4   Linked
6  a6    a4   Linked
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
1
ids <- c("a1", "a2", "a3", "a4", "a5", "a6")
linked.ids <- c("a1", "a1", "a3", "a4", "a4", "a4")

count = table(linked.ids) > 1
linked.flag = rep("", length(ids))
linked.flag[linked.ids %in% names(count[count])] = "Linked"
df <- data.frame(IDs = ids, Links = linked.ids, LinkFlag = linked.flag)
Baumann
  • 1,119
  • 11
  • 20
  • Both solutions mentioned thus far are much better than my implementation, but in my actual data set, there were 9 entries (out of 300 000) for which Baumann's solution correctly identified a linked event, while Sven Hohenstein's did not. – Pavel L Feb 10 '14 at 14:42