2

There are three main columns to my data shown below (squirrel_id = unique individual id, byear = year of birth, and dyear = year of death):

> summary(complete)
  squirrel_id        byear          dyear     
 Min.   :  416   Min.   :1989   Min.   :1989  
 1st Qu.: 4152   1st Qu.:1997   1st Qu.:1998  
 Median : 7870   Median :2003   Median :2004  
 Mean   :10419   Mean   :2004   Mean   :2004  
 3rd Qu.:16126   3rd Qu.:2011   3rd Qu.:2012  
 Max.   :23327   Max.   :2017   Max.   :2017 

I have a second piece of data (shown below) that I am trying to incorporate into the above dataset.

mast.yr<-c("1993", "1998", "2005", "2010", "2014")

I am trying to do two things:

  1. Add a column that says if the individual (squirrel_id) was alive during any of the mast.yr years (dyear- byear = range of years alive (including byear and dyear).
  2. Add another column that counts how many mast.yr years each individual (squirrel_id) experienced during lifetime (dyear- byear = range of years alive (including byear and dyear).

To generate the first column, I've been using the mutate function in the dplyr package, but I can only get it to work for the byear and dyear separately, like so:

complete <- complete %>%
    mutate (mast = ifelse (byear %in% c("1993", "1998", "2005", "2010", "2014"), 1, 0), 
    mast = ifelse (dyear %in% c("1993", "1998", "2005", "2010", "2014"), 1, 0))) 

But it doesn't give the desired output since it is considering the byear and dyear on their own, rather than as a continuous time period. I have tried the solutions posted here and here, but have had no luck.

Any suggestions would be appreciated!

A copy of my data can be found here. For reproducibility in the future, here is a sample:

> head(DF)

> squirrel_id  byear dyear
  <dbl>        <int> <dbl>
  6715         2006  2006
  22274        2016  2017
  20445        2014  2017
  19528        2013  2013
  2674         1995  1995
  1419         1992  1993
Blundering Ecologist
  • 1,199
  • 2
  • 14
  • 38

2 Answers2

2
# put target years in a table
mastDF = data_frame(year = as.integer(mast.yr))

# count based on conditions    
dat %>% 
  mutate(in_mast = count_matches(., mastDF, year >= byear, year <= dyear) > 0) %>%
  as.tbl

# A tibble: 100 x 4
   squirrel_id byear dyear in_mast
         <int> <int> <int>   <lgl>
 1        6715  2006  2006   FALSE
 2       22274  2016  2017   FALSE
 3       20445  2014  2017    TRUE
 4       19528  2013  2013   FALSE
 5        2674  1995  1995   FALSE
 6        1419  1992  1992   FALSE
 7       15014  2004  2004   FALSE
 8       10946  2009  2012    TRUE
 9        4369  1998  1999    TRUE
10        4344  1992  1999    TRUE
# ... with 90 more rows

where count_matches is a helper function:

library(data.table)
count_matches = function(DF, targetDF, ...){
  onexpr = substitute(list(...))
  data.table(targetDF)[data.table(DF), on=eval(onexpr), allow.cart=TRUE, .N, by=.EACHI]$N
}

If you want both the count and whether the count is non-zero, this can be done by breaking it into a sequence of mutate arguments:

dat %>% 
  mutate(
    n_mast = count_matches(., mastDF, year >= byear, year <= dyear),
    in_mast = n_mast > 0
) %>% as.tbl

# A tibble: 6 x 5
  squirrel_id byear dyear n_mast in_mast
        <int> <int> <int>  <int>   <lgl>
1        6715  2006  2006      0   FALSE
2       22274  2016  2017      0   FALSE
3       20445  2014  2017      1    TRUE
4       19528  2013  2013      0   FALSE
5        2674  1995  1995      0   FALSE
6        1419  1992  1993      1    TRUE
Frank
  • 66,179
  • 8
  • 96
  • 180
  • Thanks for your help with this. Do you have any suggestions for how to add another column that counts how many `mast.yr` years each individual (`squirrel_id`) experienced during lifetime. – Blundering Ecologist May 05 '18 at 01:13
  • @BlunderingEcologist Sure. The count_matches function returns that count initially. I've edited the post to demonstrate. If you're interested, I wrote some details for how the function works here: https://stackoverflow.com/a/50140399/ – Frank May 05 '18 at 03:24
  • 1
    Thanks for explaining that a bit more. I appreciate your help! – Blundering Ecologist May 05 '18 at 15:42
1

Though @Frank has already provided an elegant solution but sqldf provides easier ways for non-equii joins. Using sqldf solution can be as:

mast.yr<-c("1993", "1998", "2005", "2010", "2014")

mastDf <- data.frame(year = as.integer(mast.yr))


library(sqldf)


sqldf("select dat.*, IFNULL(Mast.inMast,0) as n_Mast, IFNULL(Mast.inMast,0) >0 as inMast
       from dat left outer join 
      (select *, count(squirrel_id) as inMast 
      from dat, mastDf 
      where mastDf.year between dat.byear AND dat.dyear
      group by squirrel_id) Mast on
      dat.squirrel_id = Mast.squirrel_id")

#     squirrel_id byear dyear n_Mast inMast
# 1          6715  2006  2006      0      0
# 2         22274  2016  2017      0      0
# 3         20445  2014  2017      1      1
# 4         19528  2013  2013      0      0
# 5          2674  1995  1995      0      0
# 6          1419  1992  1992      0      0
# 7         15014  2004  2004      0      0
# 8         10946  2009  2012      1      1
# 9          4369  1998  1999      1      1
# 10         4344  1992  1999      2      1
#....90 more rows
MKR
  • 19,739
  • 4
  • 23
  • 33
  • Thanks for your help with this. Do you have any suggestions for how to add another column that counts how many `mast.yr` years each individual (`squirrel_id`) experienced during lifetime. – Blundering Ecologist May 05 '18 at 01:13
  • @BlunderingEcologist Yes, you can easily do that. Just add `group by squirrel_id` and use `count`. I have updated solution to cover that. – MKR May 05 '18 at 06:11