3

EDIT

Thank you for your replies. However, I still haven't managed to work out my problem, as my dataset contains 700,000 obeservations, and all the approaches below results in error, or simply continues to run for hours without finishing (I can tell that Rstudio R Session is running and consuming a lot of my RAM, but it simply isn't going anywhere).

As you can imagine, splitting up the dataset in smaller pieces is not an option, since it would defeat the purpose of the exercise: I need to look through every single previous observation to get out the desired result.

Any ideas? I will leave the question unanswered for now, but if you guys think I should post a new question, I will (I honestly don't know the etiquette regarding these things, so be free to leave a suggestion).


Original post

As the title suggest, I am looking for a dummy variable, which is conditioned on repetitions in grouped observations.

Consider the following dataframe:

   id name year
1   c   af 2000
2   c   el 2000
3   c   in 2000
4   c   ud 2000
5   d   ot 2000
6   d   an 2000
7   d   el 2000
8   d   un 2000
9   f   yt 2002
10  f   ip 2002
11  f   ot 2002
12  f   el 2002
13  g   yt 2003
14  g   af 2003
15  g   ol 2003
16  g   in 2003
17  h   in 2003
18  h   eg 2003
19  h   yt 2003
20  h   af 2003
21  j   ot 2004
22  j   el 2004
23  j   ip 2004
24  j   yt 2004

I am looking for a function, that will allow me to group the data by id, and return the value "1" if an id contains at least three names have occurred in a previous id. By previous id, I mean that the year of the previous id has to be less than for the current id.

The desired output should look like this:

   id name year dummy
1   c   af 2000     0
2   c   el 2000     0
3   c   in 2000     0
4   c   ud 2000     0
5   d   ot 2000     0
6   d   an 2000     0
7   d   el 2000     0
8   d   un 2000     0
9   f   yt 2002     0
10  f   ip 2002     0
11  f   ot 2002     0
12  f   el 2002     0
13  g   yt 2003     0
14  g   af 2003     0
15  g   ol 2003     0
16  g   in 2003     0
17  h   in 2003     0
18  h   eg 2003     0
19  h   yt 2003     0
20  h   af 2003     0
21  j   ot 2004     1
22  j   el 2004     1
23  j   ip 2004     1
24  j   yt 2004     1

id = "j" takes on the value dummy = "1", as at least three names, "yt", "ip" and "ot", occurs in id = "f". In this case, there was also a fourth name that also occurred, "el", but this does not affect the result.

Notice that id = "h" takes on the value dummy = "0", even though three names also occurred in id = "g". This is because both occurrences happened in 2003, and so it does not fulfill the condition of separate years.

Data:

DF = structure(list(id = c("c", "c", "c", "c", "d", "d", "d", "d", 
"f", "f", "f", "f", "g", "g", "g", "g", "h", "h", "h", "h", "j", 
"j", "j", "j"), name = c("af", "el", "in", "ud", "ot", "an", 
"el", "un", "yt", "ip", "ot", "el", "yt", "af", "ol", "in", "in", 
"eg", "yt", "af", "ot", "el", "ip", "yt"), year = c(2000L, 2000L, 
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2002L, 2002L, 2002L, 
2002L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 
2004L, 2004L, 2004L, 2004L), dummy = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 
1L, 1L)), .Names = c("id", "name", "year", "dummy"), row.names = c(NA, 
-24L), class = "data.frame")
Lucas E
  • 105
  • 2
  • 9
  • for id="h" and year=2003, names "in" and "af" occurred in id="c" and year=2000 and name "yt" occurred in id="f" and year=2002, shouldn't dummy be 1 then? – chinsoon12 Oct 04 '18 at 01:42
  • 1
    No, the names have to occur in the same id, otherwise the condition of an id having *at least* three identical names has not been fulfilled. – Lucas E Oct 06 '18 at 19:55

6 Answers6

3

An approach in base R:

n <- split(DF$name, DF$id)
m1 <- sapply(n, function(s1) sapply(n, function(s2) sum(s1 %in% s2) ))
diag(m1) <- 0
m1[upper.tri(m1)] <- 0
r1 <- rownames(m1)[!!rowSums(m1 > 2)]

y <- sapply(split(DF$year, DF$id), unique)
m2 <- sapply(y, function(s1) sapply(y, function(s2) +(s1 == s2) ))
diag(m2) <- 0
m2[upper.tri(m2)] <- 0
r2 <- rownames(m2)[!rowSums(m2)]

DF$dummy2 <- as.integer(DF$id %in% intersect(r1,r2))

which gives:

> DF
   id name year dummy dummy2
1   c   af 2000     0      0
2   c   el 2000     0      0
3   c   in 2000     0      0
4   c   ud 2000     0      0
5   d   ot 2000     0      0
6   d   an 2000     0      0
7   d   el 2000     0      0
8   d   un 2000     0      0
9   f   yt 2002     0      0
10  f   ip 2002     0      0
11  f   ot 2002     0      0
12  f   el 2002     0      0
13  g   yt 2003     0      0
14  g   af 2003     0      0
15  g   ol 2003     0      0
16  g   in 2003     0      0
17  h   in 2003     0      0
18  h   eg 2003     0      0
19  h   yt 2003     0      0
20  h   af 2003     0      0
21  j   ot 2004     1      1
22  j   el 2004     1      1
23  j   ip 2004     1      1
24  j   yt 2004     1      1
Jaap
  • 81,064
  • 34
  • 182
  • 193
2

Similar to Jaap's and see24's, but using length(intersect(x,y)) instead of ==/%in% with rowSums/sum:

library(data.table)
setDT(DF)
idDT = unique(DF[, .(id, year)])
setkey(idDT, id)

s = split(DF$name, DF$id)

# identify pairs of ids, where id1 appears before id2 in the table
pairsDT = idDT[, CJ(id1 = id, id2 = id)[id1 < id2]]

# record whether it's strictly before 
pairsDT[, earlier := idDT[id1, x.year] < idDT[id2, x.year]]

# if it's strictly before, compare number of matching elements
pairsDT[earlier == TRUE, matched := 
  mapply(function(x, y) length(intersect(x, y)), s[id1], s[id2]) >= 3
]

dum_ids = pairsDT[matched == TRUE, unique(id2)]

Then you can record the criterion in idDT (where it would make more sense) or DF:

idDT[, dum := id %in% dum_ids]
DF[, dum := id %in% dum_ids]

In base R, something similar could be done using combn. I guess this is still pretty inefficient compared to just storing the data in a graph (eg, with the igraph package) and working from there.

Frank
  • 66,179
  • 8
  • 96
  • 180
1

Here is my solution using dplyr and tidyr and a function to identify the ids with 3 or more matching names:

library(dplyr)
library(tidyr)

test <- function(x){
  out2 <- sapply(1:length(x), function(j){
    out <- sapply(1:j, function(i){
      sum(x[[j]] %in% x[[i]])
    })
    out[j]<-NA
    which(out >= 3) %>% min() %>% {ifelse(is.infinite(.),NA,.)}

  })
  out2
}

DF2 <-  DF %>% group_by(id, year) %>% 
  summarise(names = list(name)) %>% ungroup() %>% 
  mutate(dummy2 = test(names)) %>% 
  mutate(year_mch = year[dummy2], 
         dummy = year_mch < year) %>% 
  unnest() 
DF2

It gives a bunch of warnings due to infinite values but that doesn't affect the outcome.

see24
  • 1,097
  • 10
  • 21
1

So this solution is pure base R. I once read an article that claimed that using . <- is a valid replacement for %>%. This is the first time I ever tried it. I think I like it

. <- DF[c('id', 'name', 'year')]
. <- merge(., ., by = 'name')
. <- .[.["id.x"] != .["id.y"] & .["year.x"] < .["year.y"],]
. <- .[c('id.x', 'id.y', 'year.x', 'year.y', "name")] 
.$n <- 1
. <- aggregate(n ~ id.x + id.y, data = ., sum) 
. <- .[.['n'] >= 3, 'id.y']

DF$dummy2 <- . == DF$id
Beemyfriend
  • 261
  • 1
  • 6
1

After OP's edit about speed and memory issues, how about a Rcpp approach:

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

#convert name into an integer code
DF[DF[,.(name=unique(name))][, IntCode := .I], iname := IntCode, on=.(name)]

library(inline)
library(Rcpp)
cppFunction('
NumericVector hasOccur(NumericVector nid, NumericVector year, List iname) {
    List namelist(iname);
    int sz = namelist.size(), i, j, m, n, nPrev, nCurr, count;
    NumericVector res(sz);

    for(i=0; i<sz; i++) {
        for(j=0; j<i; j++) {
            if (nid[j] < nid[i] && year[j] < year[i]) {
                SEXP prevList = namelist[j];
                SEXP currList = namelist[i];

                NumericVector cl(currList);
                NumericVector pl(prevList);
                nPrev = pl.size();
                nCurr = cl.size();

                res[i] = 0;
                count = 0;
                for(m=0; m<nCurr; m++) {
                    for (n=0; n<nPrev; n++) {
                        if (cl[m] == pl[n]) {
                            count++;
                            break;
                        }
                    }
                }

                if (count >= 3) {
                    res[i] = 1;
                    break;
                }
            }
        }
    }

    return(res);
}')

d <- DF[, .(.(nm=iname)), by=.(nid, year)]
DF[d[, dummy := hasOccur(d$nid, d$year, d$V1)], dummy := dummy, on=.(nid, year)]

HTH.


Another possible data.table approach:

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

          #self non-equi join
check3 <- DF[DF, .(x.id, x.name, x.year, x.nid, i.id, i.name, i.year, i.nid), on=.(nid<nid, year<year, name=name)][,
    #count the number of occurrence in previous id and year
    uniqueN(x.name, na.rm=TRUE), by=.(i.id, i.year, x.id, x.year)][,
        #check if more than 3
        any(V1 >= 3L), by=.(i.id, i.year)]

#update join to add result to original DF
DF[check3, dummy := as.integer(V1), on=c("id"="i.id", "year"="i.year")]
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • Thank you for your solution. It works for the example, but not when applied to my ~700,000 observation data set (see my edit in the question). – Lucas E Oct 09 '18 at 17:47
  • @LucasE I added a `Rcpp` approach – chinsoon12 Oct 10 '18 at 00:55
  • I installed the necessary packages, and tried to run the code. After one hour, Rstudio was still running the code, and I had to terminate it. At this point, Rstudios stopped responding, and I had to restart Rstudios. But thanks anyways! (I want to point out that I am working from my work laptop, which might not be tough enough to handle such big operations.) – Lucas E Oct 10 '18 at 10:46
  • @LucasE May I ask if the solution works now? Or you just trying to close the question? – chinsoon12 Oct 23 '18 at 22:22
0

I will take any excuse to convert a data problem into a graph problem, so cheers to frank for bringing this up. Here is an igraph solution. Essentially it converts the data into a directed tree. All nodes are only compared to nodes higher up in the hierarchy. So C is the top of the tree and isn't compared to anything else while J is the terminal and is compared to all nodes above it in the chain. To pull all the nodes that is higher in the hierarchy, all you need to do is use the (depth first search) dfs function

library(tidyverse)
library(igraph)

#node list containing data specific to the group
nodelist <- DF %>%
  group_by(id, year) %>%
  nest()

#edge list containing connections. A group directly before a node points toward a future group
edgelist <- data.frame(
  from = nodelist$id %>% .[1:(length(.)-1)],
  to = nodelist$id %>% .[2:length(.)]
)

#create the data frame
g <- graph_from_data_frame(edgelist, T, nodelist)

#let's iterate through the nodes
dummy <- map_lgl(V(g)$name, function(vertex){

  #depth first search to pull out all nodes higher up on the tree 
  full_path <- dfs(g, vertex, 'in', unreachable = F) %>%
    .$order %>% 
    .[!is.na(.)] 

  #if there is no node higher up, then we're done
  if(length(full_path) <= 1) return(F)

  #The first node returned is the node we're iterating over
  this_vertex <- full_path[1]
  other_vertices <- full_path[full_path != this_vertex]

  #this is the logic for the dummy variable
  similar_groups <- map_lgl(other_vertices, function(other_vertex){
    (sum(this_vertex$data[[1]]$name %in% other_vertex$data$name) >= 3) & 
      (this_vertex$year[[1]] != other_vertex$year)
  })

  return(T %in% similar_groups)
})

V(g)$dummy2 <- dummy

as_data_frame(g, 'vertices') %>%
  unnest()

enter image description here

   name year dummy2 name1 dummy
1     c 2000  FALSE    af     0
2     c 2000  FALSE    el     0
3     c 2000  FALSE    in     0
4     c 2000  FALSE    ud     0
5     d 2000  FALSE    ot     0
6     d 2000  FALSE    an     0
7     d 2000  FALSE    el     0
8     d 2000  FALSE    un     0
9     f 2002  FALSE    yt     0
10    f 2002  FALSE    ip     0
11    f 2002  FALSE    ot     0
12    f 2002  FALSE    el     0
13    g 2003  FALSE    yt     0
14    g 2003  FALSE    af     0
15    g 2003  FALSE    ol     0
16    g 2003  FALSE    in     0
17    h 2003  FALSE    in     0
18    h 2003  FALSE    eg     0
19    h 2003  FALSE    yt     0
20    h 2003  FALSE    af     0
21    j 2004   TRUE    ot     1
22    j 2004   TRUE    el     1
23    j 2004   TRUE    ip     1
24    j 2004   TRUE    yt     1
struggles
  • 825
  • 5
  • 10
  • Interesting approach! However, it returns the error message: `> g <- graph_from_data_frame(edgelist, T, nodelist) Error in graph_from_data_frame(edgelist, T, nodelist) : Duplicate vertex names`. y And consequently results in the following errors: `+ return(T %in% similar_groups) + }) Error in "igraph" %in% class(graph) : object 'g' not found > > V(g)$dummy2 <- dummy Error in V(g)$dummy2 <- dummy : object 'g' not found > > as_data_frame(g, 'vertices') %>% + unnest() Error in "igraph" %in% class(graph) : object 'g' not found`. Any thoughts? – Lucas E Oct 09 '18 at 15:34
  • your node list has to have unique values in the first column. – struggles Oct 09 '18 at 15:38
  • Right! After filtering out the duplicates, I managed to get the code running, so thanks a lot! I also added a progress bar to monitor the process. It seems it takes about 12 seconds to loop though the 'dummy' function, and it seems to be slowing down. If it does that for all 71,883 observations, it will take Rstudio at the very least 10 days to complete the function. I absolutely think this approach is beautiful and intuitive, but do you think this approach is the fastest? – Lucas E Oct 09 '18 at 17:39