3

I have a series of coordinates for fish caught from a boat at different datetimes and different trips. How do I determine whether the coordinates of a fish are likely to be incorrect (e.g. due to transcription error) based on time since last fish caught within that same trip and an assumed boat speed (say 10km/hour).

Here is a simple example dataset with 2 trips and two fish per trip.

library(sf)
library(ggplot2)
library(dplyr)
library(lubridate)

datetime <- ymd_hms('2017-05-13 14:00:00', tz = "Etc/GMT+8")
df <- data_frame(DateTimeCapture = c(datetime, datetime + minutes(35), datetime + days(2), 
                                     datetime + days(2) + minutes(20)),
                 Trip = c('1', '1', '2', '2'),
                 Order = c(1, 2, 1, 2),
                 X = c(648635, 648700, 647778, 658889),
                 Y = c(5853151, 5853200, 5854292, 5870000))

# if you prefer to work in sf
df_sf <-  st_as_sf(df, coords = c('X', 'Y'), crs = 32610)

# quick plot
ggplot() + 
  geom_point(data = df, aes(x = X, y = Y, color = Trip)) 

enter image description here

The distance between the two fish in the second trip is 19km:

st_distance(df_sf[3:4, ])
Units: m
         [,1]     [,2]
[1,]     0.00 19240.47
[2,] 19240.47     0.00

It is unlikely that a boat could travel 19km in 20 minutes. Thus this should be flagged as a possible error.

My preference is for solutions using sf, but may also accept solutions using sp. It has to be r-based solution.

sebdalgarno
  • 2,929
  • 12
  • 28

2 Answers2

2

This might solve your problem:

fun1 <- function(k){
  dat <- st_as_sf(df[which(df$Trip == k),], coords = c('X', 'Y'), crs = 32610)
  times <- as.numeric(diff(dat$DateTimeCapture))
  distances <- st_distance(dat)
  distances <- diag(distances[-1,])

  tresh <- 10000/60 # 10km/h is our treshold here
  problematic <- as.numeric(distances/times) > tresh

  if(length(which(problematic)) >= 1){
    v <- matrix(F, nrow = length(dat$Trip))
    v[which(problematic)+1] <- T
    return(v)
  }
  if(length(which(problematic)) == 0){
    v <- matrix(F, nrow = length(dat$Trip))
    return(v)
  }
} # brief explanations below

My output

unlist(sapply(unique(df$Trip), fun1, simplify = F))
   11    12    21    22 
FALSE FALSE FALSE  TRUE 

# and now cbinding it into the data frame:
> newcol <- unlist(sapply(unique(df$Trip), fun1, simplify = F))
> df <- cbind(df, newcol)
> df
       DateTimeCapture Trip Order      X       Y newcol
11 2017-05-14 00:00:00    1     1 648635 5853151  FALSE
12 2017-05-14 00:35:00    1     2 648700 5853200  FALSE
21 2017-05-16 00:00:00    2     1 647778 5854292  FALSE
22 2017-05-16 00:20:00    2     2 658889 5870000   TRUE

Brief explanation

The above function checks if a given trip contains anomalies.

  1. It computes the time differences (times) and the distance matrix (distances).
  2. Now it suffices to consider the sub- or super-diagonal of distances. Indeed, for a given trip, these diagonals both contain all the distances between two consecutive captures.
  3. Then, all it remains to do is to check whether distance/time > tresh (here 10km/h).

Now, that function can be adapted, polished, etc. E.g. you might want to pass tresh as an argument into the function and give it a default value using missing().

Disclaimer I slightly edited your data (added a third point in trip 2 to have a more interesting test case):

df <- data.frame(DateTimeCapture = c(datetime, datetime + minutes(35), datetime + days(2), 
                                 datetime + days(2) + minutes(20), datetime + days(2) + minutes(45)),
             Trip = c('1', '1', '2', '2', '2'),
             Order = c(1, 2, 1, 2, 3),
             X = c(648635, 648700, 647778, 658889, 658999),
             Y = c(5853151, 5853200, 5854292, 5870000, 5890978))
niko
  • 5,253
  • 1
  • 12
  • 32
  • I would prefer the output to be a logical vector (e.g. that I can add as a column to original data.frame) indicating whether an individual fish is potentially erroneous. – sebdalgarno Mar 30 '18 at 00:00
  • @sebdalgarno PS: the function now returns `TRUE` if the given fish is potentially erroneous and `FALSE` else. – niko Mar 30 '18 at 00:22
  • see comment above: the first fish in a trip should never be TRUE – sebdalgarno Mar 30 '18 at 00:33
  • @sebdalgarno using `fun1`, the first fish will never be `TRUE`. (`Order` corresponds to the different fish right?) – niko Mar 30 '18 at 00:36
2

The sf::st_distance() generates a distance matrix between all geometries.

From this matrix we can extract just the distances we care about, then use those distances to calculate the speed traveled, and add a flag if it's over a certain threshold

library(dplyr)

max_speed <- 10 ## km/h


df_sf %>%
    mutate(distance = {
        dist_mat <- sf::st_distance(.)
        distances <- dist_mat[ upper.tri(dist_mat) ]
        idx <- cumsum(2:ncol(dist_mat) - 1)
        distances <- c(0, distances[ idx ] )
        distances[.$Order == 1] <- 0         ## first trip gets 0 distance
        distances
    }) %>%
    mutate( time = as.numeric(difftime(DateTimeCapture, lag(DateTimeCapture))),
                    speed = distance / time) %>%
    mutate( error_flag = speed > max_speed ) 


# 
# Simple feature collection with 4 features and 7 fields
# geometry type:  POINT
# dimension:      XY
# bbox:           xmin: 647778 ymin: 5853151 xmax: 658889 ymax: 5870000
# epsg (SRID):    32610
# proj4string:    +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs
# # A tibble: 4 x 8
#    DateTimeCapture     Trip  Order distance   time   speed error_flag               geometry
#    <dttm>              <chr> <dbl>    <dbl>  <dbl>   <dbl> <lgl>           <sf_geometry [m]>
# 1  2017-05-14 08:00:00 1      1.00      0     NA    NA     NA         POINT (648635 5853151)
# 2  2017-05-14 08:35:00 1      2.00     81.4   35.0   2.33  F          POINT (648700 5853200)
# 3  2017-05-16 08:00:00 2      1.00      0   2845     0     F          POINT (647778 5854292)
# 4  2017-05-16 08:20:00 2      2.00  19240     20.0 962     T          POINT (658889 5870000)

Detail

A little bit of detail about what's going on in the first mutate call to get the distances.

The st_distance() function gives a distance matrix from each geometry to each other.

dist_mat <- sf::st_distance(df_sf)
dist_mat
# Units: m
#             [,1]        [,2]      [,3]     [,4]
# [1,]     0.00000    81.40025  1427.000 19723.93
# [2,]    81.40025     0.00000  1429.177 19648.30
# [3,]  1427.00035  1429.17739     0.000 19240.47
# [4,] 19723.92752 19648.30072 19240.467     0.00

From this matrix we want the values at [1, 2], [2, 3] and [3, 4]

So to start we can take the upper-triangle

distances <- dist_mat[ upper.tri(dist_mat) ]
distances
# Units: m
# [1]    81.40025  1427.00035  1429.17739 19723.92752 19648.30072 19240.46738

Then grab the 1st, 3rd and 6th indeces of this vector

idx <- c(cumsum(2:ncol(dist_mat) - 1))
idx
# [1] 1 3 6

To give us the distances

c(0, distances[ idx ] )
# [1]     0.00000    81.40025  1429.17739 19240.46738
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • can you provide a solution that treats each trip separately? perhaps I didn't make that clear. For example, the distance value for the third fish should be 0, as it doesn't make sense to compare the distance of one fish to a fish in a separate trip. – sebdalgarno Mar 29 '18 at 23:41
  • @sebdalgarno - good point. Would it be enough just to remove the first trip in each trip, using `%>% filter(Order > 1)` ? – SymbolixAU Mar 29 '18 at 23:53
  • @sebdalgarno - I've added an extra line into the `mutate` call to set first-trip distances to 0. There are a few other ways of doing this too. – SymbolixAU Mar 30 '18 at 00:15
  • I like the answer. On my real dataset when distance difference and time difference is 0, error_flag = NA. There is an easy fix though to change all NA to FALSE – sebdalgarno Mar 30 '18 at 00:39
  • @sebdalgarno It would be interesting to see where the limit (if you hit it) is of the number of points `st_distance()` can handle. With a lot of observations you'll end up with a large matrix. – SymbolixAU Mar 30 '18 at 00:51