0

I'm getting started with R, so please bear with me

For example, I have this data.table (or data.frame) object :

Time               Station      count_starts    count_ends
01/01/2015 00:30       A            2               3
01/01/2015 00:40       A            2               1
01/01/2015 00:55       B            1               1
01/01/2015 01:17       A            3               1
01/01/2015 01:37       A            1               1

My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :

Time               Station      sum(count_starts)   sum(count_ends)
01/01/2015 01:00       A            4                      4
01/01/2015 01:00       B            1                      1
01/01/2015 02:00       A            4                      2

I did some research and found out that I should use the xts library.

Thanks for helping me out

UPDATE :

I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.

Blaze Tama
  • 10,828
  • 13
  • 69
  • 129
  • https://stackoverflow.com/questions/16444242/rounding-times-to-the-nearest-hour-in-r – BENY Nov 28 '17 at 03:43
  • Thanks for the link, I can "group" the datetime by hour, but the hard part is how to sum the other columns by hourly rate? – Blaze Tama Nov 28 '17 at 04:13

3 Answers3

1

A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.

library(dplyr)
library(lubridate)

dt2 <- dt %>%
  mutate(Time = mdy_hm(Time)) %>%
  mutate(Time = ceiling_date(Time, unit = "hour")) %>%
  group_by(Time, Station) %>%
  summarise(`sum(count_starts)` = sum(count_starts),
            `sum(count_ends)` = sum(count_ends)) %>%
  ungroup()
dt2
# # A tibble: 3 x 4
#                  Time Station `sum(count_starts)` `sum(count_ends)`
#                <dttm>   <chr>               <int>             <int>
# 1 2015-01-01 01:00:00       A                   4                 4
# 2 2015-01-01 01:00:00       B                   1                 1
# 3 2015-01-01 02:00:00       A                   4                 2

DATA

dt <- read.table(text = "Time               Station      count_starts    count_ends
'01/01/2015 00:30'       A            2               3
'01/01/2015 00:40'       A            2               1
'01/01/2015 00:55'       B            1               1
'01/01/2015 01:17'       A            3               1
'01/01/2015 01:37'       A            1               1",
                 header = TRUE, stringsAsFactors = FALSE)

Explanation

mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.

www
  • 38,575
  • 12
  • 48
  • 84
  • Thanks for your help. I will try your solution, but I don't understand the code. Is it possible to solve the question with `data.table` and `xts` library instead? – Blaze Tama Nov 28 '17 at 03:52
  • @BlazeTama I am not familiar with `xts` package. Perhaps others can help you. – www Nov 28 '17 at 03:52
  • @BlazeTama I have added some explanation to the code. – www Nov 28 '17 at 04:02
  • Thanks for the explanation, I will try it as an alternative. A little question, my string format is `%d/%m/%H`, can I read it with pattern? – Blaze Tama Nov 28 '17 at 04:09
  • @Blaze Tama. The reason you have an error is because you are using `as.Date` on a factor instead of on a character. try doing `as.Date(as.character(...))` – Onyambu Nov 28 '17 at 04:12
  • @Onyambu I successfully convert the column to POSIXct (updated the question), now I need to perform the hourly rate calculation somehow – Blaze Tama Nov 28 '17 at 04:14
  • @BlazeTama You can replace `mdy_hm` with `dmy_hm` if the first pattern is day. – www Nov 28 '17 at 04:15
1

Using base R, we can still do the above. Only that the hour will be one less for all of them:

 dat=read.table(text = "Time               Station      count_starts    count_ends
 '01/01/2015 00:30'       A            2               3
 '01/01/2015 00:40'       A            2               1
 '01/01/2015 00:55'       B            1               1
 '01/01/2015 01:17'       A            3               1
 '01/01/2015 01:37'       A            1               1",
             header = TRUE, stringsAsFactors = FALSE)

 dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
 aggregate(.~Time+Station,dat,sum)
                  Time Station count_starts count_ends
 1 2015-01-01 00:00:00       A            4          4
 2 2015-01-01 01:00:00       A            4          2
 3 2015-01-01 00:00:00       B            1          1

You can use the order function to rearrange the table or even the sort.POSIXlt function:

 m=aggregate(.~Time+Station,dat,sum)
 m[order(m[,1]),]
                  Time Station count_starts count_ends
 1 2015-01-01 00:00:00       A            4          4
 3 2015-01-01 00:00:00       B            1          1
 2 2015-01-01 01:00:00       A            4          2
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thanks. I tried your solution, but the cut method return an error : Error in seq.int(0, to0 - from, by) : 'to' must be a finite number. I'm still solving this error. – Blaze Tama Nov 28 '17 at 05:59
  • Ohh you need to be careful. Ensure your `dat` is as originally given before you replace the `dat$Time`. Just rerun the `dat` again Then run the two codes I gave – Onyambu Nov 28 '17 at 06:06
  • I have now given the data. Run the data then the code – Onyambu Nov 28 '17 at 06:08
  • Thanks a lot for the help. I'm doing timeseries analysis, will there be a difference of "one hour less"? i.e the 00:00:00 will sum between 00:00:00 to 01:00:00 – Blaze Tama Nov 28 '17 at 06:11
  • 1
    There is no difference. 00:00:00 will sum between 00:00:00 to 01:00:00 and 01:00:00 will sum between 01:00:00 to 02:00:00 and so on.. – Onyambu Nov 28 '17 at 06:13
0

There are basically two things required: 1) round of the Time to nearest 1 hour window:

library(data.table)

library(lubridate)

data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))

data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]

data[,Time_round:=floor_date(Time_conv,unit="1 hour")]

2) List the data table obtained above to get the desired result:

New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']
www
  • 38,575
  • 12
  • 48
  • 84
Rohit Mishra
  • 441
  • 4
  • 17