1

I have a dataset x_output that looks like this:

          timestamp   city wait_time weekday
2015-07-14 09:00:00 Boston       1.4 Tuesday
2015-07-14 09:01:00 Boston       2.5 Tuesday
2015-07-14 09:02:00 Boston       2.8 Tuesday
2015-07-14 09:03:00 Boston       1.6 Tuesday
2015-07-14 09:04:00 Boston       1.5 Tuesday
2015-07-14 09:05:00 Boston       1.4 Wednesday

I would like to find the mean wait_time, grouped by city, weekday, and time. Basically, given your city, what is the average wait time for Monday, for example? Then Tuesday?

I'm having difficulty creating the time column given x_output$timestamp; I'm currently using:

x_output$time <- strsplit(as.character(x_output$timestamp), split = " ")[[1]][2]

However, that simply puts "09:00" in every row, not the correct time for each individual row.

Secondly, I need to have a 3-way grouping to find the mean wait_time given city, weekday and time. This is something that's fairly straightforward to do in python pandas, but I can find very little documentation on it in R (and unfortunately I need to do it in R, not python).

I've looked into using data.table but that hasn't seemed to work. Is there a simple function like there would be in python pandas (eg. df.groupby(['col1', 'col2', 'col3']).mean())?

Alex Petralia
  • 1,730
  • 1
  • 22
  • 39
  • 1
    Something like `df %>% group_by(city, weekday) %>% mutate(MeaD = mean(wait_time))` with `dplyr` but please post useful code. Your code has only `NA`s in wait time and only one city and only one day. With this date is impossible to help you. – SabDeM Jul 16 '15 at 00:56
  • Sorry about that! Fixed now. I will test this out shortly. – Alex Petralia Jul 16 '15 at 01:03
  • Should be more like `sapply(strsplit(as.character(x_output$timestamp), split = " "),'[',2)` to extract the second element from each vector in the list. – MrFlick Jul 16 '15 at 01:05
  • What is the criteria to group by `timestamp`? every day? every 12 hours? every week? – SabDeM Jul 16 '15 at 01:12
  • One more thing: you are asking 3 things (slightly unrelated), while the title is only about grouping data. Maybe it is better for you open another question (or do a search) about how to parse the `timestamp` column as a date and time. – SabDeM Jul 16 '15 at 01:22

1 Answers1

0

Mean wait_time grouped by city, weekday, time:

library(plyr)
ddply(x_output, .(city, weekday, time), summarize, avg=mean(wait_time))

If you wanted data.table

x_output[, list(avg=mean(wait_time)), .(city, weekday, time)]

I'm having difficulty creating the time column given x_output$timestamp

Well, what is the time column supposed to have in it? Just the time component of timestamp? Is timestamp a POSIXct or a string?

If it is a POSIXct, then you can just convert to character, specifying the time format:

x_output$time <- as.character(x_output$timestamp, '%H:%M')
# or as.factor(as.character(...)) if you need it to be a factor.
# in data.table: x[, time:=as.character(timestamp, '%H:%M')]

This will make the time column a string with the hour and minutes. See ?strptime for more options on converting that datetime to a string (e.g. if you want to include seconds).

If it is a string, you could strsplit and extract the second component:

vapply(strsplit(x_output$timestamp, ' '), '[', i=2, 'template')

which will give you "HH:MM:SS" as your time format. If you want to do a custom time format, probably best to convert your timestamp string into a POSIXct and back out to the specific format like already mentioned.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • The timestamp -> time worked perfectly. I tried both methods for the multiple groupby. Using `ddplyr`, I received `Error in attributes(out) <- attributes(col) : 'names' attribute [11] must be the same length as the vector [1]`. Using the data.table, I received `Error in drop && !has.j : invalid 'x' type in 'x && y'`. My colclasses are (POSIXlt, factor, numeric, character, character). – Alex Petralia Jul 16 '15 at 01:46
  • In that case, you must provide a reproducible example. The data you have provided in your question works fine on the code I have provided, but only has one record for each city and time. – mathematical.coffee Jul 16 '15 at 01:51
  • I feel a bit iffy about randomly connecting to a mystery IP from my work computer (I tried anyway and it timed out - probably firewall). In any case, the problem appears to be associated with your specific data, so you will have to narrow down if something's malformed in there. – mathematical.coffee Jul 16 '15 at 03:17
  • Ok I will look into this shortly or post the .csv elsewhere if I can't get it to work. – Alex Petralia Jul 16 '15 at 14:28
  • While I'm trying to figure this out, here's the data set and code so far: https://github.com/alexpetralia/RMV_wait_times – Alex Petralia Jul 16 '15 at 17:16
  • Your `wait_times.csv` does not have data in the format of your question, and I do not have time to wade through your shiny code to try grab out the bits that reconstruct it. Please provide a reproducible example. – mathematical.coffee Jul 17 '15 at 03:18
  • See also http://stackoverflow.com/questions/14153092/meaning-of-ddply-error-names-attribute-9-must-be-the-same-length-as-the-vec - looks like your timestamp is `POSIXlt`. Make it a `POSIXct` instead. – mathematical.coffee Jul 17 '15 at 03:22
  • This worked. I have one final question which I'll open up a new question for. Thanks a lot for your help so far. – Alex Petralia Jul 17 '15 at 12:12