3

I have a CSV file that looks like this, where "time" is a UNIX timestamp:

time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14
... and so on

I am reading it into R and converting the time column into POSIXct like so:

data <- read.csv(file="data.csv",head=TRUE,sep=",")
data[,1] <- as.POSIXct(data[,1], origin="1970-01-01")

Great so far, but now I would like to build a histogram with each bin corresponding to the average hourly count. I'm stuck on selecting by hour and then counting. I've looked through ?POSIXt and ?cut.POSIXt, but if the answer is in there, I am not seeing it.

Any help would be appreciated.

Dan Barowy
  • 2,270
  • 24
  • 35
  • Have a look at http://stackoverflow.com/questions/1256347/plot-time-data-in-r-to-various-resolutions-to-the-minute-to-the-hour-to-the-se – Benjamin Mar 18 '11 at 18:25

3 Answers3

3

Here is one way:

R> lines <- "time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14"
R> con <- textConnection(lines); df <- read.csv(con); close(con)
R> df$time <- as.POSIXct(df$time, origin="1970-01-01")
R> df$hour <- as.POSIXlt(df$time)$hour
R> df
                 time count hour
1 2011-03-15 05:13:52     5    5
2 2011-03-11 13:23:52     0   13
3 2011-03-13 09:48:52     1    9
4 2011-03-16 04:18:52     4    4
5 2011-03-12 12:28:52     1   12
6 2011-03-14 08:53:52     1    8
7 2011-03-10 17:03:52     9   17
8 2011-03-09 20:43:52    14   20
R> tapply(df$count, df$hour, FUN=mean)
 4  5  8  9 12 13 17 20 
 4  5  1  1  1  0  9 14 
R> 

Your data doesn't actually yet have multiple entries per hour-of-the-day but this would average over the hours, properly parsed from the POSIX time stamps. You can adjust with TZ info as needed.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Beautiful! After posting, I came up with a way using c, subset, and mean, but I had to have a subset and mean call for every 'bin'. This is much easier to understand. Thank you! – Dan Barowy Mar 21 '11 at 13:35
1

There's a good post on this topic on Mages' blog. To get the bucketed data:

aggregate(. ~ cut(time, 'hours'), data, mean)

If you just want a quick graph, ggplot2 is your friend:

qplot(cut(time, "hours"), count, data=data, stat='summary', fun.y='mean')

Unfortunately, because cut returns a factor, the x axis won't work properly. You may want to write your own, less awkward bucketing function for time, e.g.

timebucket = function(x, bucketsize = 1,
                      units = c("secs", "mins",  "hours", "days", "weeks")) {
  secs = as.numeric(as.difftime(bucketsize, units=units[1]), units="secs")
  structure(floor(as.numeric(x) / secs) * secs, class=c('POSIXt','POSIXct'))
}
qplot(timebucket(time, units="hours"), ...)
Alice Purcell
  • 12,622
  • 6
  • 51
  • 57
1

You can calculate the hour "bin" for each time by converting to a POSIXlt and subtracting away the minute and seconds components. Then you can add a new column to your data frame that would contain the hour bin marker, like so:

date.to.hour <- function (vec)
{
    as.POSIXct(
        sapply(
            vec,
            function (x)
            {
                lt = as.POSIXlt(x)
                x - 60*lt$min - lt$sec
            }),
        tz="GMT",
        origin="1970-01-01")
}

data$hour <- date.to.hour(as.POSIXct(data[,1], origin="1970-01-01"))
Daniel Dickison
  • 21,832
  • 13
  • 69
  • 89