10

I'm trying to solve a data management problem in R.

Suppose my data looks as follows:

id <- c("123", "414", "606")
next.up <- c("414", "606", "119")
is.cond.met <- as.factor(c("FALSE", "FALSE", "TRUE"))
df <- data.frame(id, next.up, is.cond.met)

> df
   id next.up is.cond.met
1 123     414       FALSE
2 414     606       FALSE
3 606     119        TRUE


And I'd like to obtain is the following:

id <- c("123", "414", "606")
next.up <- c("414", "606", "119")
is.cond.met <- as.factor(c("FALSE", "FALSE", "TRUE"))
origin <- c("606", "606", "119")
df.result <- data.frame(id, next.up, is.cond.met, origin)

> df.result
   id next.up is.cond.met origin
1 123     414       FALSE    606
2 414     606       FALSE    606
3 606     119        TRUE    119


In other words: I want to match each ID to its "origin" when a given condition (is.met) is true. The difficulty I'm having is that this is iterative and hierarchical: to find the origin I may have to go through multiple degrees of separations. the logical steps are illustrated below. I'm really not sure how to tackle this in R.

logical steps


UPDATE
One of the comments propose a data.frame solution which works for sorted data, as in the minimal example above. In truth, my data is not sorted in such a manner. A better example is as follows:

id <- c("961980", "14788", "902460", "900748", "728912", "141726", "1041190", "692268")
next.up <- c("20090", "655036", "40375164", "40031850", "40368996", "961980", "141726", "760112")
is.cond.met <- c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)
df <- data.frame(id, next.up, is.cond.met, stringsAsFactors = FALSE)

glimpse(df)

Observations: 8
Variables: 3
$ id          <chr> "961980", "14788", "902460", "900748", "728912", "141726", "1041190", "692268"
$ next.up     <chr> "20090", "655036", "40375164", "40031850", "40368996", "961980", "141726", "760112"
$ is.cond.met <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE
> df
       id  next.up is.cond.met
1  961980    20090        TRUE
2   14788   655036       FALSE
3  902460 40375164       FALSE
4  900748 40031850       FALSE
5  728912 40368996       FALSE
6  141726   961980       FALSE
7 1041190   141726       FALSE
8  692268   760112       FALSE


UPDATE 2: the end result should look like this:

> df.end.result
       id  next.up is.cond.met origin
1  961980    20090        TRUE   <NA>
2   14788   655036       FALSE   <NA>
3  902460 40375164       FALSE   <NA>
4  900748 40031850       FALSE   <NA>
5  728912 40368996       FALSE   <NA>
6  141726   961980       FALSE 961980
7 1041190   141726       FALSE 961980
8  692268   760112       FALSE   <NA>
Jaap
  • 81,064
  • 34
  • 182
  • 193
Thomas Speidel
  • 1,369
  • 1
  • 14
  • 26
  • Can you add a few more rows to the example? I"m not sure I'm following the logic – David Arenburg Jul 13 '16 at 17:44
  • Are you trying to match entries from 2 different data set? That's the only sense I can try to make out of this. If that is the case, it might be helpful to make that explicit. – Bryan Goggin Jul 13 '16 at 17:50
  • @BryanGoggin No. the data is as in df. df.result only shows what I would like the end result to look like. – Thomas Speidel Jul 13 '16 at 17:59
  • 1
    If there is a gap in the data as you mentioned in the updated example what should the result be? Should the algorithm search for the next.up only in the next row or in all the following rows? And if the next.up cannot be found what should the algorithm do? –  Jul 19 '16 at 08:33
  • @George91 It should search in all the following (and/or previous) rows. If no match is found, origin==NA – Thomas Speidel Jul 19 '16 at 13:45
  • If you want to do this efficiently, you should probably store your data in a tree structure (next.up being the "parent" to a given node) and use an algo to traverse. I'm no computer scientist and so don't know precisely the best tool, but this looks promising: https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html Of course, this assumes that your data has a modicum of sanity (no next.up loops, no repeating id rows). – Frank Jul 20 '16 at 16:31

3 Answers3

5

I have extended your example data a bit to show what happens with more TRUE values in is.cond.met. Using the data.table package, you could do:

library(data.table)
setDT(df)[, grp := shift(cumsum(is.cond.met), fill=0)
          ][, origin := ifelse(is.cond.met, next.up, id[.N]), by = grp][]

which gives:

> df
    id next.up is.cond.met grp origin
1: 123     414       FALSE   0    606
2: 414     606       FALSE   0    606
3: 606     119        TRUE   0    119
4: 119     321       FALSE   1    321
5: 321     507        TRUE   1    507
6: 507     185        TRUE   2    185

Explanation:

  1. Create a grouping variable first with shift(cumsum(is.cond.met), fill=0).
  2. With ifelse(is.cond.met, next.up, id[.N]) you assign the correct values to origin.

Note: The id and next.up columns should be of class character for the above to work (for that reason I used stringsAsFactors = FALSE in constructing the extended example data). If they are factors, convert them first with as.character. If is.cond.met isn't already a logical, convert it with as.logical.


On the updated example data, the above code gives:

        id  next.up is.cond.met grp origin
1:  961980    20090        TRUE   0  20090
2:   14788   655036       FALSE   1 692268
3:  902460 40375164       FALSE   1 692268
4:  900748 40031850       FALSE   1 692268
5:  728912 40368996       FALSE   1 692268
6:  141726   961980       FALSE   1 692268
7: 1041190   141726       FALSE   1 692268
8:  692268   760112       FALSE   1 692268

Used data:

id <- c("123", "414", "606", "119", "321", "507")
next.up <- c("414", "606", "119", "321", "507", "185")
is.cond.met <- c(FALSE, FALSE, TRUE, FALSE, TRUE, TRUE)

df <- data.frame(id, next.up, is.cond.met, stringsAsFactors = FALSE)
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thanks. While adapting the code to the real data, I get the following error: `Type of RHS ('character') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)` – Thomas Speidel Jul 13 '16 at 18:32
  • @ThomasSpeidel Did you convert the columns as I said in the ***note***? – Jaap Jul 13 '16 at 18:34
  • @ThomasSpeidel What is the result of `sapply(df, class)`? – Jaap Jul 13 '16 at 18:43
  • I did. `> sapply(df.test, class) id next.up is.cond.met grp origin "character" "character" "logical" "integer" "integer" ` – Thomas Speidel Jul 13 '16 at 18:54
  • @ThomasSpeidel Could you include a sample of your real data in the question? (preferably the output of for example `dput(head(df.test, 10))`) – Jaap Jul 13 '16 at 20:35
  • for some reasons the code works if I filter the original data. But produces the wrong "origin" when I do not filter it. Does sort order matter in the data.table approach provided? – Thomas Speidel Jul 13 '16 at 22:03
  • 1
    @ThomasSpeidel As a result of the way you presented the example data, the order does matter in the approach I've given above. If you can't share (a part of) your real data, could you construct an example dataset the better mimics your real data and problem? Without an example that reproduces the problem, it is quite a lot harder to say how to solve this. – Jaap Jul 14 '16 at 06:43
  • thanks. It's hard to capture the data idiosyncrasies while preserving confidentiality, maximizing brevity and clarity ;-). My data is not sorted. I have edited the question to include a better example. – Thomas Speidel Jul 14 '16 at 14:21
  • @ThomasSpeidel I've included the output of my code on the updated example. Could you indicate whether this is the output you expected? If that is not the case, could you indicate in which rows it should give a different result (and in what way)? – Jaap Jul 19 '16 at 19:33
  • I've included another update with what it should look like – Thomas Speidel Jul 20 '16 at 16:18
3

So, imho, I think you cannot solve it without an interative update.

Similar to @procrastinatus-maximus here is an iterative solution with dplyr

library(dplyr)
dfIterated <- data.frame(df, cond.origin.node = id, 
                         cond.update = is.cond.met, stringsAsFactors = F)
initial.cond <- dfIterated$is.cond.met
while(!all(dfIterated$is.cond.met %in% c(TRUE, NA))) {
  dfIterated <- dfIterated %>% 
    mutate(cond.origin.node = if_else(is.cond.met,
                                      cond.origin.node, 
                                      next.up),
           parent.match = match(next.up, id),
           cond.update = (cond.update[parent.match] | cond.update),
           cond.origin.node = if_else(!is.cond.met & cond.update,
                                      next.up[parent.match],
                                      next.up),
           is.cond.met = cond.update)
}
# here we use ifelse instead of if_else since it is less type strict
dfIterated %>%
  mutate(cond.origin.node = ifelse(initial.cond,  
                                   yes = NA, 
                                   no  = cond.origin.node))

edit: added starting condition; replaced ifelse by dplyr::if_else


Explanation: We iteratively update the dfIterated to include all next.up nodes as already suggested. Here we do it for every id in parallel.

  1. We mutate cond.origin.node and replace it by the id if cond.is.met == TRUE and with next.up "otherwise" - NA values in cond.is.met will return NA values themselfes, which is very pratical in our case.
    • We then compute the matching parent index
  2. We update the cond.update where we match the parent in the id column. (Values which will return NA, i.e., there are no matches in id, will be replaced by NA.) And we use | (or) operator which fortunetaley will return TRUE == (TRUE | NA) if there is previous TRUE entry in cond.update
  3. We then need to compute the originating node for the TRUE condition.
  4. And then upde the condition in is.cond.met
  5. Repeat everything till our is.cond.met consists solely of TRUEs or NAs. The orgin will contain nodes for which the cond.is.met == TRUE

The output of the above example looks like this:

> dfIterated
       id  next.up is.cond.met cond.origin.node cond.update
1  961980    20090        TRUE             <NA>        TRUE
2   14788   655036          NA             <NA>          NA
3  902460 40375164          NA             <NA>          NA
4  900748 40031850          NA             <NA>          NA
5  728912 40368996          NA             <NA>          NA
6  141726   961980        TRUE           961980        TRUE
7 1041190   141726        TRUE           961980        TRUE
8  692268   760112          NA             <NA>          NA

Hope this helps! A forward lookup would work in similar fashion. Further improvements depend on what kind of results you want to keep (e.g. do you really want to overwrite is.cond.met ?)

Drey
  • 3,314
  • 2
  • 21
  • 26
  • this is almost what I want, except row 1 should evaluate to missing because as soon as the condition is met, **origin** should not evaluate to **next.up**. See my update. That should be trivial to fix. My data is huge, so I hope this approach is fast! I will test on the real data and will report back. – Thomas Speidel Jul 20 '16 at 16:23
  • 1
    Interestingly it is not that *trivial* to fix :-) since it would break the assumptions for the algorithm -- but luckely we can store it in the beginning and replace the origin at the end accordingly. (Edits made in the code). The additional storage complexity is in O(2n) since we just require 3+3 additional columns - that mean that if you load you < 10GB file expect roughly 20GB result. Single operations should be fast, they are already vectorized (consider using `dplyr::if_else` for faster processing instead of `ifelse`. The algorithm should stop after the deepest node has been reached. – Drey Jul 20 '16 at 18:23
  • 1
    btw if you data is more than 10 GB you can switch to data.table. It will be definetly faster - the principle will remain the same but the syntax will be different. – Drey Jul 20 '16 at 18:36
  • Update: I'm still having problems with the code proposed. It works fine in the example. However, it does not work as expected on the real data. I'm struggling to understand why. – Thomas Speidel Jul 25 '16 at 14:44
  • Since the bounty is about to expire, I will award it to @Drey, since his solution gets me the closest to what I'm trying to achieve. – Thomas Speidel Jul 25 '16 at 14:46
  • Thank you! Can you describe in more detail what is troubling you with the real data ? – Drey Jul 26 '16 at 07:41
3

I hope I've understood your problem correctly and here follows my point of view. It seems like you try to solve a network problem in terms of tables of data. I suggest the following formulation.

We have a network, defined as a set of edges (columns id and next.up correspond to vertex_from and vertex_to). The network is a set of trees. Column is.cond.met maps vertices which are endpoints or the roots of the trees. Trees with unmapped root aren't taken into account.

I've slightly modified your MRE to make it more demonstrative.

id <- c("961980", "14788", "902460", "900748", "728912", "141726", "1041190", "692268", "40368996", "555555", "777777")
next.up <- c("20090", "655036", "40375164", "40031850", "40368996", "961980", "141726", "760112", "692268", "760112", "555555")
is.cond.met <- c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)
dt <- data.table(id, next.up, is.cond.met, stringsAsFactors = FALSE)

Now lets translate everything to the language of graphs.

library(data.table)
library(magrittr)
library(igraph)

graph_from_edgelist(as.matrix(dt[, 1:2, with = F])) -> dt_graph
V(dt_graph)$color <- ifelse(V(dt_graph)$name %in% dt[is.cond.met == T]$next.up, "green", "yellow")
E(dt_graph)$arrow.size <- .7
E(dt_graph)$width <- 2
plot(dt_graph, edge.color = "grey50")

We have the following graph. enter image description here

Green vertices are mapped roots - let's name them treeroots. Their neghbors of the fisrt order are the roots of the big main branches of each tree - let them be branchroots. The problem is for every vertex in id column of initial data find out corresponding branchroot.

treeroots <- dt[is.cond.met == T]$next.up %>% unique
lapply(V(dt_graph)[names(V(dt_graph)) %in% treeroots], 
       function(vrtx) neighbors(dt_graph, vrtx, mode = "in")) -> branchroots

We can find all vertices downlaying to each branchroot with the help of ego function from igraph package.

lapply(seq_along(branchroots), function(i) {
  data.table(tree_root = names(branchroots[i]), branch_root = branchroots[[i]]$name)
}) %>% rbindlist() -> branch_dt

branch_dt[, trg_vertices := ego(dt_graph, order = 1e9, 
                                V(dt_graph)[names(V(dt_graph)) %in% branch_dt$branch_root], 
                                mode = "in", mindist = 1) %>% lapply(names)]

branch_dt
#    tree_root branch_root    trg_vertices
# 1:     20090      961980  141726,1041190
# 2:    760112      692268 40368996,728912
# 3:    760112      555555          777777

After that we can create the origin column.

sapply(seq_along(branch_dt$branch_root), 
       function(i) rep(branch_dt$branch_root[i], 
                       length(branch_dt$trg_vertices[[i]]))) %>% unlist -> map_vertices
branch_dt$trg_vertices %>% unlist() -> map_names
names(map_vertices) <- map_names

dt[, origin := NA_character_]
dt[id %in% map_names, origin := map_vertices[id]]
dt
#           id  next.up is.cond.met origin
#  1:   961980    20090        TRUE     NA
#  2:    14788   655036       FALSE     NA
#  3:   902460 40375164       FALSE     NA
#  4:   900748 40031850       FALSE     NA
#  5:   728912 40368996       FALSE 692268
#  6:   141726   961980       FALSE 961980
#  7:  1041190   141726       FALSE 961980
#  8:   692268   760112        TRUE     NA
#  9: 40368996   692268       FALSE 692268
# 10:   555555   760112       FALSE     NA
# 11:   777777   555555       FALSE 555555

For the sake of convenience, I've arranged the resulting code into a function.

add_origin <- function(dt) {
  require(data.table)
  require(magrittr)
  require(igraph)

  setDT(dt)
  graph_from_edgelist(as.matrix(dt[, .(id, next.up)])) -> dt_graph

  treeroots <- dt[is.cond.met == T]$next.up %>% unique

  lapply(V(dt_graph)[names(V(dt_graph)) %in% treeroots], 
         function(vrtx) neighbors(dt_graph, vrtx, mode = "in")) -> branchroots

  lapply(seq_along(branchroots), function(i) {
    data.table(tree_root = names(branchroots[i]), branch_root = branchroots[[i]]$name)
  }) %>% rbindlist() -> branch_dt

  branch_dt[, trg_vertices := rep(list(NA), nrow(branch_dt))][]
  vertices_on_branch <- ego(dt_graph, order = 1e9, 
                            V(dt_graph)[names(V(dt_graph)) %in% branch_dt$branch_root], 
                            mode = "in", mindist = 1) %>% lapply(names)
  set(branch_dt, j = "trg_vertices", value = list(vertices_on_branch))

  sapply(seq_along(branch_dt$branch_root),
         function(i) rep(branch_dt$branch_root[i], 
                         length(branch_dt$trg_vertices[[i]]))) %>% unlist -> map_vertices
  branch_dt$trg_vertices %>% unlist() -> map_names
  names(map_vertices) <- map_names

  dt[, origin := NA_character_]
  dt[id %in% map_names, origin := map_vertices[id]]
  dt[]
}

For your MRE it produces the desired output.

df0 <- data.frame(id = c("961980", "14788", "902460", "900748", "728912", "141726", "1041190", "692268"),
                  next.up = c("20090", "655036", "40375164", "40031850", "40368996", "961980", "141726", "760112"),
                  is.cond.met = c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), stringsAsFactors = FALSE)

df0 %>% add_origin

#         id  next.up is.cond.met origin
# 1:  961980    20090        TRUE     NA
# 2:   14788   655036       FALSE     NA
# 3:  902460 40375164       FALSE     NA
# 4:  900748 40031850       FALSE     NA
# 5:  728912 40368996       FALSE     NA
# 6:  141726   961980       FALSE 961980
# 7: 1041190   141726       FALSE 961980
# 8:  692268   760112       FALSE     NA

The described approach should be significantly faster than iterative update of a data.frame inside a loop.

inscaven
  • 2,514
  • 19
  • 29