5

I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.

The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.

What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.

From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).

Any pointers are appreciated.

For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.

klonq
  • 3,535
  • 4
  • 36
  • 58

5 Answers5

5

Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.

Using a slight modification of csgillespie's sample data:

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

plot(w_sp, power)

x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
hadley
  • 102,019
  • 32
  • 183
  • 245
  • Thanks, I have gone with this solution. As it has given me the correct results based on my test case. – klonq Jan 30 '11 at 15:20
  • I tried modelling this to real data and am not entirely happy with the result. Unfortunately I cannot publish the data, but I have made the graph available at http://www.myimagespace.com/public/view/full/5617. Although its the best solution so far it doesn't really relate closely to the data. How can I 'tweak' the code to get a better fitting curve? – klonq Feb 01 '11 at 09:05
  • @klonq my immediate guess would be that you probably can't, without introducing other problems. The easiest way to get these local models to fit the data better is to make them more local (decrease `span` in `loess()` or increase `k` in `gam()`. Quite often though, the increased complexity fits the data better in some areas, but overfits in others. Hence the adaptive smoother I tried in my example, where smoothness/roughness is varied over the range of the fit; the curve can be rough where the relationship is changing and smooth where there is no or little change. – Gavin Simpson Feb 01 '11 at 11:45
2

First we will create some example data to make the problem concrete:

w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))

Suppose we want to bin the power values between [0,5), [5,10), etc. Then

bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))

We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:

plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)

To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:

noOutliers = function(x, power, w_sp, bin_incr) {
  d = power[w_sp >= x & w_sp < (x + bin_incr)]
  m_d = mean(d)
  d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
  return(mean(d_trim))
}

y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
csgillespie
  • 59,189
  • 14
  • 150
  • 185
2

Throw this version, similar in motivation as @hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:

Dummy data first, as used by @hadley

w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)

Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML

require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)

Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval

x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)

Plot everything and the Loess fit for comparison

plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from @hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
      lwd = 3)

adaptive smooth and loess fits

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thanks Gavin this is a much nicer solution. However I can't get it working (1 Error, 1 Warning) – klonq Feb 01 '11 at 08:10
  • Error in eval(predvars, data, env) : numeric 'envir' arg not of length one – klonq Feb 01 '11 at 08:10
  • Caused by line pred <- predict(mod, x_grid, se.fit = TRUE) and followed by the Warning message : In predict.gam(mod, x_grid, se.fit = TRUE) : not all required variables have been supplied in newdata! (I am using real data, not dummy data) – klonq Feb 01 '11 at 08:16
  • @klonq apologies, there is a missing line before the one you quote. Will edit the answer to rectify this. – Gavin Simpson Feb 01 '11 at 09:31
  • Hi, I have been struggling with this error a few times today I wonder if you can help. I can't even find where things are going wrong the error is "Error in splineDesign(knots, x, ord, derivs, outer.ok = outer.ok) : the 'x' data must be in the range -0.0452226 to 22.6226 unless you set 'outer.ok = TRUE'" I thought I had it resolved at first by inserting a line into my csv where both columns have value of 0, but now the error is referring to a negative number and I have no negatives in my data – klonq Feb 01 '11 at 14:15
  • You are predicting outside the range of the data. Check your data and the predict range. Note that the `seq()` function can generate data that might not be exactly contained within the range of the observed data, due to floating point-ness. Just shrink the range a bit in the `seq(....)` line, e.g. `x_grid <- with(df, data.frame(w_sp = seq(min(w_sp) + 0.0001, max(w_sp) - 0.0001, length = 100)))`. – Gavin Simpson Feb 01 '11 at 14:21
1

Here are some examples of fitted curves (weibull analysis) for commercial turbines:

http://www.inl.gov/wind/software/

http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf

http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf

bill_080
  • 4,692
  • 1
  • 23
  • 30
0

I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .

    # If you haven't already installed ggplot2:
    install.pacakges("ggplot2", dependencies = T)

    # Load the ggplot2 package
    require(ggplot2)

    # csgillespie's example data
    w_sp <- sample(seq(0, 100, 0.01), 1000)
    power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)

    # Bind the two variables into a data frame, which ggplot prefers
    wind <- data.frame(w_sp = w_sp, power = power)

    # Take a look at how the first few rows look, just for fun
    head(wind)


    # Create a simple plot
    ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()

    # Create a slightly more complicated plot as an example of how to fine tune
    # plots in ggplot
    p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
    p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot") 
    p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
    p3 + scale_x_continuous(name = "mph") + 
             scale_y_continuous(name = "power") +
             opts(title = "Wind speed and power")
jthetzel
  • 3,603
  • 3
  • 25
  • 38