2

My original data.table consists of three columns.
site, observation_number and id.

E.g. the following which is all the observations for id = z

|site|observation_number|id
|a   |                 1| z                 
|b   |                 2| z
|c   |                 3| z

Which means that ID z has traveled from a to b to c.

There is no fixed number of sites per id.

I wish to transform the data to an edge list like this

|from |to||id|
|a    | b| z |
|b    | c| z |

mock data

sox <- data.table(site =  c('a','b','c','a','c','c','a','d','e'),
       obsnum =c(1,2,3,1,2,1,2,3,4),
       id     =c('z','z','z','y','y','k','k','k','k'))

The way I am currently doing this, feels convoluted and is very slow (sox has 1.5 mio rows and dt_out has ca. 7.5 mio. rows). I basically use a for loop over observation_number to split the data in to chunks where each ID is only present once (that is - only one journey, to - from). Then I cast data, and rind all the chunks to a new data.table.

dt_out <- data.table()
maksimum = sox[,max(observation_number)]
for (i in 1:maksimum-1) {
  i=1
  mini = i
  maxi = i+1
  sox_t <- sox[observation_number ==maxi | observation_number ==mini, ]
  temp_dt <- dcast(sox_t[id %in% sox_t[, .N, by = id][N>=2]$id,
                             .SD[, list(site, observation_number, a=rep(c('from', 'to')))] ,by=id],
                       id='id', value.var='site', formula=id~a)
  dt_out <- rbind(dt_out, temp_dt)
  i=max
  }

I hope someone can help me optimize this, and preferable create a function where I can input the data.table, the site id, observationnumber id, and the id. For some reason I can't create a function regardless that works.

UPDATE

Using sytem time (and running system time a few times):

                             User - System - Elapsed
make_edgelist (data.table):  5.38     0.00      5.38
Data.table. with shift:     13.96     0.06     14.08 
dplyr, with arrange:         6.06     0.36      6.44

p.s. make_edgelist was updated to order the data.table

make_edgelist <- function(DT, site_var = "site", id_var = "id", obsnum_var   = "rn1") {
    DT[order(get(obsnum_var)),
    list(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}

I was surprised that dplyr (with lead) was almost as fast as make_edgelist and much faster than data.table with shift. I guess this means that dplyr will actually be faster with more complex lead/lags/shift.

Also I find it puzzling - but don't know enough to know if it has any significance, that dplyr used more 'system' time than any of the two data.table solutions.

Input data: 1.5 million rows. Result: 0.6 million rows.

Andreas
  • 6,612
  • 14
  • 59
  • 69

3 Answers3

4

With dplyr, you can try:

sox %>%
 group_by(id) %>%
 transmute(from = site,
           to = lead(from)) %>%
 na.omit()

  id    from  to   
  <chr> <chr> <chr>
1 z     a     b    
2 z     b     c    
3 y     a     c    
4 k     c     a    
5 k     a     d    
6 k     d     e    

As @Sotos noted, it could be useful to arrange the data first:

sox %>%
 arrange(id, obsnum) %>%
 group_by(id) %>%
 transmute(from = site,
           to = lead(from)) %>%
 na.omit()
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
3

Is this what you are looking for?

sox[, .(from = site[-.N], to = site[-1]), by = id]

#    id from to
# 1:  z    a  b
# 2:  z    b  c
# 3:  y    a  c
# 4:  k    c  a
# 5:  k    a  d
# 6:  k    d  e

Wrapped in a function:

make_edgelist <- function(DT, site_var = "site", id_var = "id") {
  DT[, .(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}

Note: This solution assumes the data is already ordered by observation number. To avoid this assumptions add order(obsnum) before the first comma.

s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Thank you. This was both the fastes solution, and the one I asked for (function - you showed me how to use "get"). – Andreas Oct 21 '19 at 08:30
2

Using data.table, in case it's faster than the dplyr solution above, you have:

sox <- sox[order(id, obsnum)]
sox[, from := shift(site), by = "id"]
sox <- sox[!is.na(from)]
setnames(sox, "site", "to")
sox[, obsnum := NULL]
setcolorder(sox, c("id", "from", "to"))
sox
#>    id from to
#> 1:  k    c  a
#> 2:  k    a  d
#> 3:  k    d  e
#> 4:  y    a  c
#> 5:  z    a  b
#> 6:  z    b  c
caldwellst
  • 5,719
  • 6
  • 22
  • very cool - both answers use shift/lead... I feel so stupid... Will accept the data.table question when I get a chance to time it. How could i make this in to a function, which accepts the data.table and the variables to use? – Andreas Oct 14 '19 at 09:06
  • Hey Andreas, I actually have never used `data.table` before, so not sure how functionalize the arguments, but I imagine there is a vignette or other resource available to you. – caldwellst Oct 14 '19 at 09:14
  • hahaha - love it - you still gave me a correct answer for a library you've never used :-) Great – Andreas Oct 14 '19 at 09:19
  • follow up: "id" is actually an identifier I prefer to hide, when passing data along. Is there a way to create a unique non-meaningfull ID variable I can use instead? thinking of something like this factor(sox$id, labels=1:length(unique(sox$id))) – Andreas Oct 14 '19 at 09:21
  • Yeah, something like that should work. I would recommend using `rbenchmark` to determine which solution is faster though, I can't guarantee that mine is. – caldwellst Oct 14 '19 at 09:28