0

I have two data frames, trips which are unique trips taken by bikes that have a unique id, and intervals, which show the locations of bikes ids at every 10 minutes. My objective is to remove records from intervals if time falls in between start and finish and bike_ids are the same. Time is of posixCT class and the original dataframes have hundreds of thousands of records

For example, the outcome of these two datasets below should be:

> trips
  bike_id               start              finish
1       1 2017-11-22 15:52:36 2017-11-22 17:47:53
2       2 2017-11-22 16:05:44 2017-11-22 16:23:25
3       3 2017-11-22 16:31:06 2017-11-22 17:11:20


  > intervals
                      time bike_id
    3  2017-11-22 16:00:03       1
    4  2017-11-22 16:10:03       1
    5  2017-11-22 16:20:02       1
    6  2017-11-22 16:30:02       1
    7  2017-11-22 16:40:03       1
    8  2017-11-22 16:50:02       1
    9  2017-11-22 17:00:02       1
    10 2017-11-22 17:10:02       1
    11 2017-11-22 17:20:03       1
    12 2017-11-22 17:30:03       1
    13 2017-11-22 16:00:03       2
    14 2017-11-22 16:10:03       2
    15 2017-11-22 16:20:02       2
    16 2017-11-22 16:30:02       2
    17 2017-11-22 16:40:03       2
    18 2017-11-22 16:50:02       2
    19 2017-11-22 17:00:02       2
    20 2017-11-22 17:10:02       2
    21 2017-11-22 17:20:03       2
    22 2017-11-22 17:30:03       2
    23 2017-11-22 16:30:02       3
    24 2017-11-22 16:40:03       3
    25 2017-11-22 16:50:02       3
    26 2017-11-22 17:00:02       3
    27 2017-11-22 17:10:02       3
    28 2017-11-22 17:20:03       3
    29 2017-11-22 17:30:03       3

And the outcome

  > outcome
                      time bike_id
    13 2017-11-22 16:00:03       2
    16 2017-11-22 16:30:02       2
    17 2017-11-22 16:40:03       2
    18 2017-11-22 16:50:02       2
    19 2017-11-22 17:00:02       2
    20 2017-11-22 17:10:02       2
    21 2017-11-22 17:20:03       2
    22 2017-11-22 17:30:03       2
    23 2017-11-22 16:30:02       3
    28 2017-11-22 17:20:03       3
    29 2017-11-22 17:30:03       3

Not exactly sure where to start. Any suggestions on where to start with dplyr or an apply function would be appreciated!

Here is the sample data:

> dput(intervals)
structure(list(time = structure(c(1511384403.94561, 1511385003.17654, 
1511385602.47887, 1511386202.99895, 1511386803.18361, 1511387402.98233, 
1511388002.69461, 1511388602.5818, 1511389203.52712, 1511389803.652, 
1511384403.94561, 1511385003.17654, 1511385602.47887, 1511386202.99895, 
1511386803.18361, 1511387402.98233, 1511388002.69461, 1511388602.5818, 
1511389203.52712, 1511389803.652, 1511386202.99895, 1511386803.18361, 
1511387402.98233, 1511388002.69461, 1511388602.5818, 1511389203.52712, 
1511389803.652), class = c("POSIXct", "POSIXt"), tzone = ""), 
    bike_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 
    2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3)), .Names = c("time", 
"bike_id"), row.names = 3:29, class = "data.frame")

> dput(trips)
structure(list(bike_id = c(1, 2, 3), start = structure(c(1511383956, 
1511384744, 1511386266), class = c("POSIXct", "POSIXt"), tzone = ""), 
    finish = structure(c(1511390873, 1511385805, 1511388680), class = c("POSIXct", 
    "POSIXt"), tzone = "")), .Names = c("bike_id", "start", "finish"
), row.names = c(NA, 3L), class = "data.frame")
iskandarblue
  • 7,208
  • 15
  • 60
  • 130

3 Answers3

3

I am very new to the package, so please test the following approach carefully.

The reason I selected not is because this task requires join by range, which is not able to perform for now. Here is a solution using the foverlaps function.

library(data.table)

# Convert the data frame to data.table
setDT(intervals)
setDT(trips)

# Create a second column time2, which is the same as time
intervals[, time2 := time]

# Set keys in trips
setkey(trips, bike_id, start, finish)

# Conduct join by bike_id and time
# The columns in intervals used for join by time 
# needs to be in the last two in by.x
intervals2 <- foverlaps(intervals, trips, 
                        by.x = c("bike_id", "time", "time2"))

# Filter the ones with NA in start, which means no match
# And then selct the time and bike_id column
outcome <- intervals2[is.na(start)][, .(time, bike_id)] 
outcome

#                   time bike_id
#  1: 2017-11-22 16:00:03       2
#  2: 2017-11-22 16:30:02       2
#  3: 2017-11-22 16:40:03       2
#  4: 2017-11-22 16:50:02       2
#  5: 2017-11-22 17:00:02       2
#  6: 2017-11-22 17:10:02       2
#  7: 2017-11-22 17:20:03       2
#  8: 2017-11-22 17:30:03       2
#  9: 2017-11-22 16:30:02       3
# 10: 2017-11-22 17:20:03       3
# 11: 2017-11-22 17:30:03       3 
www
  • 38,575
  • 12
  • 48
  • 84
1

This can be solved with a kind of non-equi anti-join.

Non-equi joins are available in data.table since version 1.9.8 (on CRAN 25 Nov 2016) and can be used as a handy replacement for foverlaps() in many cases. In particular, foverlaps() requires the second parameter to be keyed while non-equi join works with unkeyed and keyed data.tables likewise.

First, a non-equi join is used to identify the indices of the rows of intervals which lie within the start and finish times of trips. Then, these rows are removed from intervals

library(data.table)
tmp <- setDT(intervals)[setDT(trips), on = .(bike_id, time >= start, time <= finish), 
                        which = TRUE]
intervals[!tmp]

                   time bike_id
 1: 2017-11-22 16:00:03       2
 2: 2017-11-22 16:30:02       2
 3: 2017-11-22 16:40:03       2
 4: 2017-11-22 16:50:02       2
 5: 2017-11-22 17:00:02       2
 6: 2017-11-22 17:10:02       2
 7: 2017-11-22 17:20:03       2
 8: 2017-11-22 17:30:03       2
 9: 2017-11-22 16:30:02       3
10: 2017-11-22 17:20:03       3
11: 2017-11-22 17:30:03       3

tmp contains the indices of the rows to be removed:

tmp
 [1]  1  2  3  4  5  6  7  8  9 10 12 13 22 23 24 25
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

Here's my answer. trips is the reference dataset.

matched() is a function to match on start and finish from trips onto intervals.

Answer

trips <- data.frame(bike_id = 1:3, 
                    start = as.POSIXct(c("2017-11-22 15:52:36", "2017-11-22 16:05:44", "2017-11-22 16:31:06")),
                    finish = as.POSIXct(c("2017-11-22 17:47:53","2017-11-22 16:23:25","2017-11-22 17:11:20")))%>%
         mutate(start = as.numeric(start),
                finish = as.numeric(finish))


matched <- function(var1, var2, df1, df2){
return(df2[,var1][match(df1[,var2],df2[,var2])])
}



intervals%>%
mutate(time_num = as.numeric(time),
       start = matched("start", "bike_id", intervals , trips),
       finish = matched("finish", "bike_id", intervals , trips))%>%
filter(time_num < start | time_num > finish)%>%
select(time, bike_id)


                  time bike_id
1  2017-11-22 16:00:03       2
2  2017-11-22 16:30:02       2
3  2017-11-22 16:40:03       2
4  2017-11-22 16:50:02       2
5  2017-11-22 17:00:02       2
6  2017-11-22 17:10:02       2
7  2017-11-22 17:20:03       2
8  2017-11-22 17:30:03       2
9  2017-11-22 16:30:02       3
10 2017-11-22 17:20:03       3
11 2017-11-22 17:30:03       3

I couldn't get between() to work for some odd reason. I'll have to see about that later.

InfiniteFlash
  • 1,038
  • 1
  • 10
  • 22