3

I have a list of marked individuals (column Mark) which have been captured various years (column Year) within a range of the river (LocStart and LocEnd). Location on the river is in meters.

I would like to know if a marked individual has used overlapping range between years i.e. if the individual has gone to the same segment of the river from year to year.

Here is an example of the original data set:

IDMark YearLocStartLocEnd
11081199221,72922,229
21081199221,20321,703
31081200521,50822,008
41126199419,22219,522
51126199418,81119,311
61283200521,75422,254
71283200722,02522,525

Here is what I would like the final answer to look like:

MarkYear1Year2IDs
1081199220051, 3
1081199220052, 3
1283200520076, 7

In this case, individual 1126 would not be in the final output as the only two ranges available were the same year. I realize it would be easy to remove all the records where Year1 = Year2.

I would like to do this in R and have looked into the >IRanges package but have not been able to consider the group = Mark and been able to extract the Year1 and Year2 information.

ekad
  • 14,436
  • 26
  • 44
  • 46
JoeBird
  • 65
  • 1
  • 10

1 Answers1

6

Using foverlaps() function from data.table package:

require(data.table)
setkey(setDT(dt), Mark, LocStart, LocEnd)               ## (1)
olaps = foverlaps(dt, dt, type="any", which=TRUE)       ## (2)
olaps = olaps[dt$Year[xid] != dt$Year[yid]]             ## (3)
olaps[, `:=`(Mark  = dt$Mark[xid], 
             Year1 = dt$Year[xid],
             Year2 = dt$Year[yid],
             xid   = dt$ID[xid], 
             yid   = dt$ID[yid])]                       ## (4)
olaps = olaps[xid < yid]                                ## (5)
#    xid yid Mark Year1 Year2
# 1:   2   3 1081  1992  2005
# 2:   1   3 1081  1992  2005
# 3:   6   7 1283  2005  2007
  1. We first convert the data.frame to data.table by reference using setDT. Then, we key the data.table on columns Mark, LocStart and LocEnd, which will allow us to perform overlapping range joins.

  2. We calculate self overlaps (dt with itself) with any type of overlap. But we return matching indices here using which = TRUE.

  3. Remove all indices where Year corresponding to xid and yid are identical.

  4. Add all the other columns and replace xid and yid with corresponding ID values, by reference.

  5. Remove all indices where xid >= yid. If row 1 overlaps with row 3, then row 3 also overlaps with row 1. We don't need both. foverlaps() doesn't have a way to remove this by default yet.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • This looks really cool and I can't wait to test it on the large data set. The function foverlaps however does not seem to be in the latest version of the package data.table (installed tonight)? I get an error that the function does not exist. It is listed in the reference manual but not when I request the list of functions i.e. ls("package:data.table"). Weird. I'll look into this some more tomorrow. – JoeBird Jan 21 '15 at 06:42
  • Thanks. It is available from 1.9.4 (probably you've to install from source). But there were some bugs quashed in 1.9.5. Probably [this](https://github.com/Rdatatable/data.table/wiki/Installation) helps. – Arun Jan 21 '15 at 06:43
  • 1
    I was able to install and use data.table version 1.9.4 after updating my version of R (which needed to be updated badly). Once that was done, your solution Arun worked like a charm! Thank you very much!! – JoeBird Jan 22 '15 at 08:01