3

The below is an example of the data I have.

     date       time    size    filename            day.of.week
1   2015-01-16  5:36:12 1577    01162015053400.xml  Friday
2   2015-01-16  5:38:09 2900    01162015053600.xml  Friday
3   2015-01-16  5:40:09 3130    01162015053800.xml  Friday

What I would like to do is sum up the size of the files for each hour.

I would like a resulting data table that looks like:

date        hour   size
2015-01-16  5      7607
2015-01-16  6      10000

So forth and so on.

But I can't quite seem to get the output I need.

I've tried ddply and aggregate, but I'm summing up the entire day, I'm not sure how to break it down by the hour in the time column.

And I've got multiple days worth of data. So it's not only for that one day. It's from that day, almost every day until yesterday.

Thanks!

smci
  • 32,567
  • 20
  • 113
  • 146
James White
  • 535
  • 10
  • 24
  • Either a) you **[use a compound group_by expression](http://stackoverflow.com/questions/24573485/calculate-hourly-means-in-r-based-on-multiple-conditions?rq=1)** (in which case I recommend dplyr) or else b) you **[convert the separate, date+time fields to a datetime object](http://stackoverflow.com/questions/8985124/how-to-create-a-datetime-object-from-separate-date-fields)**. Either way, this is a duplicate. Once you clarify which you prefer, this should be closed as duplicate. My recommendation is dplyr with compound group_by expression, it's performant and clean. – smci Mar 06 '15 at 01:56

4 Answers4

2

The following should do the trick, assuming your example data are stored in a data frame called "test":

library(lubridate) # for hms and hour functions
test$time <- hms(test$time) 
test$hour <- factor(hour(test$time))
library(dplyr)
test %>%
   select(-time) %>% # dplyr doesn't like this column for some reason
   group_by(date, hour) %>%
   summarise(size=sum(size))
goctlr
  • 41
  • 2
  • My data frame is called s3_files. When I change all instances of test to s3_files and run this, I get one number back, 1220046076. Which I believe is the sum of all the files. Not by hour by day. – James White Mar 05 '15 at 18:28
1

You can use data.table

library(data.table)
# Define a time stamp column.
dt[, timestamp=as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))]
# Aggregate by hours
dt[, size = .N, by = as.POSIXct(round(timestamp, "hour"))]

Benefit is that data.table is blazing fast!

Stereo
  • 1,148
  • 13
  • 36
0

Use a compound group_by(day,hour)

That will do it.

smci
  • 32,567
  • 20
  • 113
  • 146
  • I think that will do it, but the time column is a factor and I can't find a way to convert a factor to a time, only a date time. Do I need to combine date and time columns together and turn them into a DateTime class? – James White Mar 05 '15 at 14:46
  • You don't necessarily need to combine them, but that's one way that could work. See all the many duplicates on that issue, it has been asked so many times it's not funny. – smci Mar 05 '15 at 14:59
  • Please point out and flag as a duplicate. This will be more useful to the site than answering the same question again. – Benjamin Mar 06 '15 at 00:50
  • @Benjamin: it most certainly **does** provide an answer to the question. The key part is a **compound group_by expression**. Read it again. If the OP wants to ask a follow-on question ("But I don't want to use a compound group_by expression... how to convert to one unified datetime field"), that's their issue. You got sidetracked on that. That is not the question asked. I answered the question asked. We're not getting into duplicates of the question that was not asked. – smci Mar 06 '15 at 01:38
  • @smci: That last comment is the auto comment from the review queue. I meant that since you are aware that this is a duplicate question, then your answer is also a duplicate answer, and belongs as a comment, along with a link to the duplicate, rather than as an answer, in my opinion. – Benjamin Mar 06 '15 at 01:42
  • @Benjamin: no it's not, you've gone off on a tangent. You're not understanding any of what's going on here. I gave the answer to the original question ("use a **compound group_by expression**"). You've gone off on a tangent about how to convert date,time to a unified datetime. That was never the question that was asked. I'm not playing moving targets with the OP. End of story. – smci Mar 06 '15 at 01:45
  • 1
    I think this answer (as it is now) is not helpful, because it's extremely incomplete: The original data frame neither has `day`, nor `hour`. – U. Windl May 16 '17 at 14:28
0

If you convert your date and time columns into a single POSIX date when (similar to a previous answer, i.e. df$when <- as.POSIXct(strptime(paste(df$date, df$time), format = "%Y-%m-%d %H:%M:%S"))), you could use:

aggregate(df[c("size")], FUN=sum, by=list(d=as.POSIXct(trunc(df$when, "hour"))))
U. Windl
  • 3,480
  • 26
  • 54