-1

I am currently facing a dataset of taxi trips by a driver in NYC. I got the driver ID as well as the pickup date and time and dropoff date and time for every trip. Now I want to calculate the waiting time between the dropoff time of the last trip and the pickup time of the new trip. Therefore I have to calculate the time difference between two columns with one Lag (because dropoff time refers to the last trip and pickup time to the next trip (next column)) grouped by driver ID (to make sure I am not calculating the time difference between trips of two different drivers).

A possible data set looks like this:

hack_license = c("303F79923DA5DA7A10DF15E2D91CDCF7","697ABFCDF7E7C77A01183C857132F2A4","697ABFCDF7E7C77A01183C857132F2A4","697ABFCDF7E7C77A01183C857132F2A4","ABE23CA71E2DE84972281BA1C70B6EBB","ABE23CA71E2DE84972281BA1C70B6EBB","BA83D7C383EAA4F9D78A1A8B83CB3E92","BA83D7C383EAA4F9D78A1A8B83CB3E92","D476A1872F1F6594BD638C274483ED06","D476A1872F1F6594BD638C274483ED06")

pickup_datetime = c("2013-12-31 23:01:07","2013-12-31 23:04:00","2013-12-31 23:31:00","2013-12-31 23:40:00","2013-12-31 23:16:39","2013-12-31 23:24:05","2013-12-31 23:09:10","2013-12-31 23:26:26","2013-12-31 23:13:00","2013-12-31 23:22:00")

dropoff_datetime = c("2013-12-31 23:20:33","2013-12-31 23:28:00","2013-12-31 23:33:00","2013-12-31 23:48:00","2013-12-31 23:22:29","2013-12-31 23:28:37","23:21:24","2013-12-31 23:36:54","2013-12-31 23:20:00","2013-12-31 23:27:00")

data <- data.frame(hack_license,pickup_datetime,dropoff_datetime)

I tried to use dplyr and lubridate like this, but it doesn't work.

data %>%
group_by(data$hack_license) %>%
  group_by(hack_license) %>%
  mutate(waiting_time_in_secs = difftime(pickup_datetime,                                       
lag(dropoff_datetime), units = 'secs'))

Maybe some of you can help me out here. Would be great!

tamtam
  • 3,541
  • 1
  • 7
  • 21
Maximilian
  • 89
  • 1
  • 7

1 Answers1

0

You can create a datetime column for both pickup and dropoff and for each hack_license calculate the difference in time between the current pickup time and previous drop off time.

library(dplyr)
library(lubridate)

data <- data %>%
          mutate(pickup_datetime = ymd_hms(pickup_datetime), 
                 dropoff_datetime = ymd_hms(dropoff_datetime)) %>%
           group_by(hack_license) %>%
           mutate(waiting_time_in_secs = as.numeric(difftime(pickup_datetime, 
                                lag(dropoff_datetime), units = 'secs')))
data
#   hack_license                     pickup_datetime     dropoff_datetime    waiting_time_in_secs
#   <chr>                            <dttm>              <dttm>                             <dbl>
# 1 303F79923DA5DA7A10DF15E2D91CDCF7 2013-12-31 23:01:07 2013-12-31 23:20:33                   NA
# 2 697ABFCDF7E7C77A01183C857132F2A4 2013-12-31 23:04:00 2013-12-31 23:28:00                   NA
# 3 697ABFCDF7E7C77A01183C857132F2A4 2013-12-31 23:31:00 2013-12-31 23:33:00                  180
# 4 697ABFCDF7E7C77A01183C857132F2A4 2013-12-31 23:40:00 2013-12-31 23:48:00                  420
# 5 ABE23CA71E2DE84972281BA1C70B6EBB 2013-12-31 23:16:39 2013-12-31 23:22:29                   NA
# 6 ABE23CA71E2DE84972281BA1C70B6EBB 2013-12-31 23:24:05 2013-12-31 23:28:37                   96
# 7 BA83D7C383EAA4F9D78A1A8B83CB3E92 2013-12-31 23:09:10 2013-12-31 23:21:24                   NA
# 8 BA83D7C383EAA4F9D78A1A8B83CB3E92 2013-12-31 23:26:26 2013-12-31 23:36:54                  302
# 9 D476A1872F1F6594BD638C274483ED06 2013-12-31 23:13:00 2013-12-31 23:20:00                   NA
#10 D476A1872F1F6594BD638C274483ED06 2013-12-31 23:22:00 2013-12-31 23:27:00                  120
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hey Ronak, thanks for you reply. I tried your idea and it says: Error in get(genname, envir = envir) : Object 'vec_ptype2' not found Do you know what it means and what I have to do now? – Maximilian Oct 18 '20 at 09:23
  • Ronak, do you also do gsub to convert auto selected "date time" by R when it reads a time column from excel? For ex, when I do read_excel it picks up time "10:30:10" as "1899-12-31 10:30:10" – Karthik S Oct 18 '20 at 09:31
  • @MaxS It means you are calling columns which is not present in your data. What is `vec_ptype2` ? I used column names based on image of data. If you still have issues applying answer to the data I would suggest to add part of your data using `dput`. Edit your post with `dput(head(data, 10))`. – Ronak Shah Oct 18 '20 at 09:43
  • @KarthikS I have not faced such issue till now when reading times into R from excel. – Ronak Shah Oct 18 '20 at 09:44
  • @Ronak, I am trying but it seems like my data frame is too big to use `dput` . I already tried to break it down with `head()` and `top_n` but the problem stays the same. Is there another way to create a sufficient example of the data? – Maximilian Oct 18 '20 at 10:33
  • You can select only relevant columns for `dput` which are needed for this post. For example, `dput(head(data[1:5], 10))` for first 5 column and 10 rows. – Ronak Shah Oct 18 '20 at 10:34
  • Hey @RonakShah, I edited the questions so now there's a code that you can use to create a sample dataset. I'm sorry for the inconvenience but `dput` didn't work... as you can see: There is no `vec_ptype2`in the data. But I'm still getting the error, that R can't find the object. – Maximilian Oct 18 '20 at 12:08
  • @MaxS you have missing date for one `dropoff_datetime`. If we correct that then the above answer works for me. See updated answer. – Ronak Shah Oct 18 '20 at 12:15
  • @RonakShah It worked! I got some weird resuls sometimes but I will look into it later. :-) Thanks a lot! – Maximilian Oct 18 '20 at 12:39