1

I have a large dataset (> 9 million rows) of times and locations when individual animals were detected at stations. I would like to calculate the distance between each station along each animal's path as it travelled between stations, as well as the time it took to travel between stations. And then I would like to summarize the total distance and time across all sections of the path.

For each individual in this dataset, the data is organized with each time it was detected at a stationary points. If the individual was at the stationary point for a long, consecutive period of time, then there are multiple records (each ~30 s apart) for this time period.

I can summarize the data below to get 1 row for each time an individual was at a station (see below). However, the output doesn't recognize when an individual travels to the same station more than once.

E.g.

id <- c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B")
site <- c("a", "a", "b", "a", "c", "c", "c", "d", "a", "b")
time <- seq(1:10)
lat <- c(1, 1, 2, 1, 3, 3, 3, 4, 1, 2)
lon <- c(1, 1, 2, 1, 3, 3, 3, 4, 1, 2)

df <- data.frame(id, site, time, lat, lon)

df %>% group_by(id, site, lat, lon) %>%
  summarize(timeStart = min(time), 
            timeEnd = max(time))

# A tibble: 6 x 6
# Groups:   id, site, lat [?]
  id    site    lat   lon timeStart timeEnd
  <fct> <fct> <dbl> <dbl>     <dbl>   <dbl>
1 A     a         1     1         1       4
2 A     b         2     2         3       3
3 A     c         3     3         5       7
4 A     d         4     4         8       8
5 B     a         1     1         9       9
6 B     b         2     2        10      10

I an approach to group the data so that the multiple visits to the same station (with trips to other stations in between) are recognized as a separate "leg" of the trip.

Then, I need to calculate the great circle distance between each station, as well as the time difference in time between timeEnd (1st station) and timeStart (2nd station).

tnt
  • 1,149
  • 14
  • 24
  • Can you provide your desired results based on your sample data. It seems you are grouping by too many parameters, but it is not clear what the expected output should be. In the meantime the `geosphere` package will calculate the great-circle distances. – Dave2e Jan 04 '19 at 19:47
  • @Dave2e I can't provide the results, because I don't know how to do it. but, for the example above, the end result should be 7 rows of data (instead of the 6 above). For individual A, there should be 2 rows for site a (before and after the individual was at site b); the start and end times for these two rows should be 1 & 2, and 4 & 4, respectively. There should also be columns for lat and lon. – tnt Jan 04 '19 at 19:57

2 Answers2

3

First, the data.table function rleid is used to create a grouping variable: for each individual, each change of site represents a new group. Within each group, calculate the desired stats:

library(data.table)
library(geosphere)
setDT(df)
df2 <- df[ , .(id = id[1],
               site = site[1],
               lat = lat[1],
               lon = lon[1],
               first_time = min(time),
               last_time = max(time)),
           by = .(id_site = rleid(id, site))]

Then, for each individual, sequential great-circle-distance between consecutive sites is calculated with geosphere::distHaversine. To avoid problems when individuals only have one or two records*, some checks are added:

df2[ , dist := if(.N == 1){
  0 } else if(.N == 2){
    c(0, distHaversine(c(lon[1], lat[1]), c(lon[2], lat[2])))
  } else c(0, distHaversine(as.matrix(.SD[ , .(lon, lat)]))), by = id]

#    id_site id site lat lon first_time last_time     dist
# 1:       1  A    a   1   1          1         2      0.0
# 2:       2  A    b   2   2          3         3 157401.6
# 3:       3  A    a   1   1          4         4 157401.6
# 4:       4  A    c   3   3          5         7 314755.2
# 5:       5  A    d   4   4          8         8 157281.8
# 6:       6  B    a   1   1          9         9      0.0
# 7:       7  B    b   2   2         10        10 157401.6
# 8:       8  C    a   1   1         11        11      0.0

Thus, for each individual, distance is calculated only once per new site. This contrasts with the other answer where distance calculations are performed between each time step (possibly many, it seems).


*Try e.g. distHaversine(cbind(1, 1)) (distGeo(cbind(1, 1))), or distHaversine(cbind(c(1, 1), c(1, 2))) (distGeo(cbind(c(1, 1), c(1, 2))))


Data

I added an individual with only one record as test case.

id <- c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "C")
site <- c("a", "a", "b", "a", "c", "c", "c", "d", "a", "b", "a")
time <- seq(1:11)
lat <- c(1, 1, 2, 1, 3, 3, 3, 4, 1, 2, 1)
lon <- c(1, 1, 2, 1, 3, 3, 3, 4, 1, 2, 1)

df <- data.frame(id, site, time, lat, lon)
Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Thanks @Henrik. I can see how your solution would work just as well as the one above. However, with the case_when statement included in the time calculation, the above solution only calculates time differences when the site changes. Also, that answer uses dplyr, which helps me keep my script similar throughout. – tnt Jan 08 '19 at 00:09
  • @user3220999 Thanks for your feedback. Sorry, but I don't quite understand which results you miss from code. Can you please clarify - I'm happy to add it! Cheers. – Henrik Jan 08 '19 at 00:14
  • @Henrick, my fault. you're answer isn't missing anything. I just prefer the answer above in dplyr. both work fine, but I can only accept one. – tnt Jan 08 '19 at 00:20
  • OK! No problem, which of the answers you accept doesn't matter to me. I was just curious if I could improve my answer, for you and not the least for future visitors of your post. Cheers – Henrik Jan 08 '19 at 00:23
2

This may not be your complete solution but it is a good start. This will find the distance and time difference between each row of data and sets the values to NA when the id changes between rows.

df <- data.frame(id, site, time, lat, lon)

library(geosphere)
library(dplyr)

#sort data by id and time
df<-df[order(df$id, df$time), ]
#find distance between each point in column
# Note longitude is the first column
df$distance<-c(NA, distGeo(df[,c("lon", "lat")]))
#find delta time between each row for each id
df<-df %>% group_by(id) %>% mutate(dtime=case_when(site != lag(site) ~ time-lag(time),
                                               TRUE ~ NA_integer_))
#remove distances where there was no delta time (row pairs with different ids)
df$distance[is.na(df$dtime)]<-NA

#id summary
df%>% summarize(disttraveled=sum(distance, na.rm=TRUE), totaltime=sum(dtime, na.rm=TRUE))
tnt
  • 1,149
  • 14
  • 24
Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • thanks @Dave2e!. I did tweak the solution so that times were only calculated when the individual moved between sites (not when it was stationary). – tnt Jan 04 '19 at 20:20