2

I have a 30 second granular data fro bunch of servers. I would like to roll this data to 15 minute for each server.

My data frame is like this:

dput(p)

structure(list(DATE = c("2013-04-15   02:47:32", "2013-04-15   02:48:02", 
"2013-04-15   02:48:32", "2013-04-15   02:49:02", "2013-04-15   02:49:32", 
"2013-04-15   02:50:02", "2013-04-15   02:50:32", "2013-04-15   02:51:02", 
"2013-04-15   02:51:32", "2013-04-15   02:52:02", "2013-04-15   02:52:32", 
"2013-04-15   02:53:02", "2013-04-15   02:53:32", "2013-04-15   02:54:02", 
"2013-04-15   02:54:32", "2013-04-15   02:55:02", "2013-04-15   02:55:32", 
"2013-04-15   02:56:02", "2013-04-15   02:56:32", "2013-04-15   02:57:02", 
"2013-04-29   17:33:07", "2013-04-29   17:33:37", "2013-04-29   17:34:07", 
"2013-04-29   17:34:37", "2013-04-29   17:35:07", "2013-04-29   17:35:37", 
"2013-04-29   17:36:07", "2013-04-29   17:36:37", "2013-04-29   17:37:07", 
"2013-04-29   17:37:37", "2013-04-29   17:38:07", "2013-04-29   17:38:37", 
"2013-04-29   17:39:07", "2013-04-29   17:39:37", "2013-04-29   17:40:07", 
"2013-04-29   17:40:37", "2013-04-29   17:41:07", "2013-04-29   17:41:37", 
"2013-04-29   17:42:07", "2013-04-29   17:42:37"), Server = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("ServerA", "ServerB"), class = "factor"), 
    CPU = c(70L, 71L, 72L, 72L, 72L, 73L, 73L, 74L, 73L, 73L, 
    73L, 73L, 71L, 74L, 72L, 72L, 70L, 72L, 71L, 70L, 78L, 79L, 
    79L, 78L, 79L, 77L, 78L, 80L, 81L, 80L, 80L, 79L, 79L, 79L, 
    81L, 79L, 78L, 79L, 79L, 79L)), .Names = c("DATE", "Server", 
"CPU"), class = "data.frame", row.names = c(NA, -40L))

Is there an easy way to roll 30 seconds data to 15 minute data for each server? I can have more than 2 servers in this data frame.

For example, if my data is as follows which includes 30 second data. I need to averarge out CPU data for every 15 minutes.

      DATE       SERVER CPU
1 2013-04-15 02:47:32 ServerA 70
2 2013-04-15 02:48:02 ServerA 71
3 2013-04-15 02:48:32 ServerA 72
4 2013-04-15 02:49:02 ServerA 72
5 2013-04-15 02:49:32 ServerA 72
6 2013-04-15 02:50:02 ServerA 73
   :
   :
   :
   :
user1471980
  • 10,127
  • 48
  • 136
  • 235
  • What do you mean by "roll to 15 minute data"? Do you want the last row of each 15 minute period by server? Or do you want to change the timestamps? Please show the desired result. – GSee May 01 '13 at 00:34
  • maybe this question & answer will help you? http://stackoverflow.com/questions/1769365/how-to-remove-partial-duplicates-from-a-data-frame – mariotomo May 01 '13 at 10:27
  • @gsee, roll over means average 30 seconds data into 15 minute data for each server for the entire data frame – user1471980 May 01 '13 at 11:47
  • Is it rolling in the sense that every 30 seconds, you drop the first 30 seconds and add the most recent 30 seconds so that you have a 15 minute rolling average, or is it 15 minute buckets that do not overlap? – GSee May 01 '13 at 15:23

3 Answers3

3

First, cast your sring to class POSIXct:

as.POSIXct(strptime("2013-04-15 02:47:32", "%Y-%m-%d %H:%M:%S"))

Next, unclass it to get epoch (number of seconds since 1970-01-01):

unclass(as.POSIXct(strptime("2013-04-15 02:47:32", "%Y-%m-%d %H:%M:%S")))

Lastly, truncate number of seconds beyond last 15 minute interval (15*60 seconds):

floor(unclass(as.POSIXct(strptime("2013-04-15 02:47:32", 
                                  "%Y-%m-%d %H:%M:%S"))
             ) / (15*60)
     ) * (15*60)

All together on data frame:

as.POSIXct(floor(unclass(as.POSIXct(strptime("2013-04-15   02:47:32", "%Y-%m-%d %H:%M:%S")))/(15*60))*(15*60), origin='1970-01-01 00:00.00 UTC')
topchef
  • 19,091
  • 9
  • 63
  • 102
  • @topchef, I am not following you. I need to aggregate my 30 seconds data to 15 minute data for each server. I am not sure you answer addresses that? – user1471980 May 01 '13 at 13:13
  • `floor(x / y)` can be written as `x %/% y` – Matthew Lundberg May 01 '13 at 14:27
  • @user1471980 - rigth, I didn't aggreage. It does everything you need before actual aggregation by server name and new time column (truncated to 15 minute intervals). In this case you don't need last cast *as.POSIXct* - numeric value suffices for aggregation. – topchef May 01 '13 at 20:44
0

what I would do:

as topchef suggests, work with POSIXct, not with strings. so, once I store in L your data, my structure would look like what you have, but instead of your DATE column, I would have ts, obtained as topchef suggests,

L$ts <- as.POSIXct(L$DATE)

you want to aggregate values, so it seems to me quite natural to add to the data the aggregation key.

baseSecond <- function(x, seconds) { 
  as.POSIXct(floor(unclass(x) / seconds) * seconds,
             origin='1970-01-01 00:00.00 UTC')
}

L$base <- baseSecond(L$ts, 15*60)

to complete the task, I'd use the aggregate standard function.

aggregate(L$Server, by=list(L$base), function(x) x[1])

the third parameter allows you choose which way to aggregate the data.

mariotomo
  • 9,438
  • 8
  • 47
  • 66
  • I am sure I am following your answer. I have a data frame that has 30 seconds data for bunch of server for CPU utilization. I would like to aggregate this 30 seconds data into 15 minute data for each server. – user1471980 May 01 '13 at 13:16
  • I'm not so sure which point you are missing. did you check the help for `aggregate`? you name it in your question title and do not use it in your own answer. – mariotomo May 01 '13 at 16:16
  • @mariotomo, `period.apply()` performs the aggregation. – Mark Rajcok Aug 16 '15 at 02:53
0

I came up with a solution like this, there may be better and faster solutions but this works for now:

apply.periodly <- function (x, FUN, period, k=1, ...) 
{
  if (!require("xts")) {
    stop("Need 'xts'")
  }
  ep <- endpoints(x, on=period, k=k)
  period.apply(x, ep, FUN, ...)
}

total_df <- data.frame(DATE=as.POSIXct(character()), CPU=as.numeric(character()),  SERVER=character())


for(i in 1:length(servers)) {

    y<-subset(x, SERVER= c(servers[i]))
    mydata.xts <- xts(y$CPU, order.by = y$DATE)
    mydata.15M <- apply.periodly(x = mydata.xts, FUN = mean, period = "minutes", k = 15)

    new_df<-data.frame(date=index(mydata.15M), coredata(mydata.15M))
    colnames(new_df)<-c("DATE", "CPU")
    new_df$SERVER<-as.character(servers[i])

    total_df<-rbind(total_df, new_df)    

}

user1471980
  • 10,127
  • 48
  • 136
  • 235