0

I'd first like to describe my problem: What i want to do is to calculate the number of spikes on prices in a 24 hour window, while I possess half hourly data.

I have seen all Stackoverflow posts like e.g. this one: Rollapply for time series

(If there are more relevant ones, please let me know ;) )

As I cannot and probably also should not upload my data, here's a minimal example: I simulate a random variable, convert it to an xts object, and use a user defined function to detect "spikes" (of course pretty ridiculous in this case, but illustrates the error).

library(xts)
##########Simulate y as a random variable
y <- rnorm(n=100)
##########Add a date variable so i can convert it to a xts object later on
yDate <- as.Date(1:100)
##########bind both variables together and convert to a xts object
z <- cbind(yDate,y)
z <- xts(x=z, order.by=yDate)
##########use the rollapply function on the xts object:
x <- rollapply(z, width=10, FUN=mean)

The function works as it is supposed to: it takes the 10 preceding values and calculates the mean.

Then, I defined an own function to find peaks: A peak is a local maximum (higher than m points around it) AND is at least as big as the mean of the timeseries+h. This leads to:

find_peaks <- function (x, m,h){
  shape <- diff(sign(diff(x, na.pad = FALSE)))
  pks <- sapply(which(shape < 0), FUN = function(i){
    z <- i - m + 1
    z <- ifelse(z > 0, z, 1)
    w <- i + m + 1
    w <- ifelse(w < length(x), w, length(x))
    if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])&x[i+1]>mean(x)+h) return(i + 1) else return(numeric(0))
  })
  pks <- unlist(pks)
  pks
}

And works fine: Back to the example:

plot(yDate,y)
#Is supposed to find the points which are higher than 3 points around them
#and higher than the average:
#Does so, so works.
points(yDate[find_peaks(y,3,0)],y[find_peaks(y,3,0)],col="red")

However, using the rollapply() function leads to:

x <- rollapply(z,width = 10,FUN=function(x) find_peaks(x,3,0))
#Error in `[.xts`(x, c(z:i, (i + 2):w)) : subscript out of bounds 

I first thought, well, maybe the error occurs because for it might run int a negative index for the first points, because of the m parameter. Sadly, setting m to zero does not change the error.

I have tried to trace this error too, but do not find the source. Can anyone help me out here?

Edit: A picture of spikes:Spikes on the australian Electricity Market. find_peaks(20,50) determines the red points to be spikes, find_peaks(0,50) additionally finds the blue ones to be spikes (therefore, the second parameter h is important, because the blue points are clearly not what we want to analyse when we talk about spikes).

Community
  • 1
  • 1
fußballball
  • 329
  • 1
  • 14
  • 1
    I am confused as to what the goal is here. Are you trying to locate peaks based on the overall mean and then using that with the few points around a given value? Your code errors at your `if` statement. In your `xts` object you have two columns, therefore the indices that are calculated `c(z : i, (i + 2) : w)` are `> 100`. The subsetting operator `[.xts` tries to take rows based on index and there are `<100` rows. – jamieRowen Feb 04 '16 at 13:01
  • It would also appear that the relation operators are not doing as you might expect here with an `xts` object – jamieRowen Feb 04 '16 at 13:21
  • I am sorry, I'll try to express myself in a better way: The peak function is supposed to find peaks. Peaks are defined as points being bigger than m points in their surroundings, and (because in periods with low volatility these points may be very low) have to surpass a threshold. The overall goal is to determine the number of peaks in a 24 hour window or on a given day, which should in the end be given by length(find_peaks). – fußballball Feb 04 '16 at 14:19
  • I think my answer below addresses this question – jamieRowen Feb 04 '16 at 14:21

1 Answers1

0

I'm still not entirely sure what it is that you are after. On the assumption that given a window of data you want to identify whether its center is greater than the rest of the window at the same time as being greater than the mean of the window + h then you could do the following:

peakfinder = function(x,h = 0){
  xdat = as.numeric(x)
  meandat = mean(xdat)
  center = xdat[ceiling(length(xdat)/2)]
  ifelse(all(center >= xdat) & center >= (meandat + h),center,NA)
}

y <- rnorm(n=100)
z = xts(y, order.by = as.Date(1:100))
plot(z)
points(rollapply(z,width = 7, FUN = peakfinder, align = "center"), col = "red", pch = 19)

Although it would appear to me that if the center point is greater than it's neighbours it is necessarily greater than the local mean too so this part of the function would not be necessary if h >= 0. If you want to use the global mean of the time series, just substitute the calculation of meandat with the pre-calculated global mean passed as an argument to peakfinder.

jamieRowen
  • 1,509
  • 9
  • 14
  • I greatly appreciate your efforts, i think i expressed myself very poorly. I do not try to find out if the median is greater than the mean. The overall goal is to determine the number of spikes in a given intervall. This may be done in whichever way you might suggest, I thought I would simply use the find_spikes function i wrote and then use length(find_spikes) to determine the number of spikes. I included a picture of "spikes" in my old post. – fußballball Feb 04 '16 at 14:22
  • Does it solve your problem? If so do you mind accepting the answer, if not then let me know what the problem is and I will try to fix it. – jamieRowen Feb 04 '16 at 14:23
  • First of all, thank you very much. Still, as i am quite new to programming languages, i have a hard time understanding what your code does. It seems to determine the "center" (median of dates if you will) and mark it as a peak if its values is bigger than that of all other values in the interval, and bigger than the average+h. After this logic, there could be only one peak per interval. Do you have any idea regarding how to modify your approach so that there can be as many peaks per interval as wished (i.e. even infinitely many) – fußballball Feb 04 '16 at 14:33
  • I guess that would depend on how you define multiple peaks per interval. The code in the answer uses the intervals to check the 3 points either side to identify if a point is a peak or not. So the code returns values for the peaks and `NA` otherwise. The number of peaks in the full time interval is the number of values that aren't `NA`, you can use `sum(!is.na(x))` on the answer to find how many. How would you want to define 2 or more peaks in a given interval? – jamieRowen Feb 04 '16 at 14:49
  • Right now I'd say my question is answered. I'll have to sit back and think some time about the next steps, then I'll try to get it done myself, if it doesn't work until saturday I'll come back to this forum ;) Thank you very much – fußballball Feb 04 '16 at 15:03