2

I have developed a system in R for graphing large datasets obtained from wind turbines. I am now porting the process into Java. The results I get between the two systems are inconsistent.

As shown below:

  • The dataset is first plotted using using R, and secondly using JFreeChart.
  • The red line in both graphs correspond to my respective calculations in each language (which are detailed below).
  • The brown dashed line in #1 corresponds to the blue line in #2, there are no discrepancies here, they are provided for reference
  • The shaded area represent the data points, grey in #1 and red in #2. Dataset graphed using R Dataset graphed using JFreeCharts

I can explain the discrepancies between the (red) calculated lines and that is due to the fact that I am using different calculation methods.

In R the data is processed as follows, I wrote this code with a little help and have no idea what is going on here (but hey, it works).

df <- data.frame(pwr = pwr, spd = spd)
require(mgcv)
mod <- gam(pwr ~ s(spd, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
x_grid <- with(df, data.frame(spd = seq(min(spd) + 0.0001, maxi, length=100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
lines(fit ~ spd, data = x_grid, col = "red", lwd = thickLineWidth)

In Java (SQL infact) I am using the method of bins to calculate the average at every 0.5 on the x-axis. The resulting data is plotted using a org.jfree.chart.renderer.xy.XYSplineRenderer I do not know too much about how the line is rendered.

SELECT 
    ROUND( ROUND( x_data * 2 ) / 2, 1)   AS x_axis, # See https://stackoverflow.com/questions/5230647/mysql-rounding-functions
    AVG( y_data )                        AS y_axis 
FROM 
    table 
GROUP BY 
    x_axis

My take on the variance between the two graphs:

  • Presence of a single outlier at 18 on the x_axis (most visible on the R graph) seems to have an enormous impact on the shape of the curve.
  • Even between 5 and 15 on the x-axis it seems that the line in the R graph is more continuous, it doesn't change trajectory as readily as that produced by Java.
  • The 'crater' evident at 18 on the java x-axis has to 'mounds' each side of it, I believe this is due to polynomial effects in the rendering system.

These are things that I would like to eliminate.

So in an effort to understand the difference between the two graphs I have a few questions:

  • Exactly what is going on in my R script?
  • How can I, or, do I want to port the same process to my Java code?
  • Can anyone explain the spline system used by JFreeCharts, is there another?
Community
  • 1
  • 1
klonq
  • 3,535
  • 4
  • 36
  • 58
  • 3
    If you want to know *exactly* what is going on in your R script you need to take a course on Generalized Additive Models, or Non-Parametric Smoothing. It's unlikely there's code to do this in Java - you'd have to write it yourself. – Spacedman Mar 08 '11 at 15:58

1 Answers1

4

In the R code, you are (well I was when I showed the example) fitting an additive model to the power and speed data, where the relationship between the variables is determined from the data themselves. These models involve the use of splines to estimate the response function. In particular here I used an adaptive smoother with k = 20 the complexity of the smoother fitting. The more complex the smoother, the more wiggly the fitted function can be. An adaptive smoother is one where the degree of smoothness varies across the fitted function.

Why is this important? Well, from your data, there are periods where the response does not vary with the speed variable, and periods where the response changes rapidly with a change in speed. We have an "allowance" of wigglyness to use up over the curve. With ordinary splines the wigglyness (or smoothness) is the same across the entire function. With an adaptive smooth we can use more of our wigglyness allowance in the parts of the function where the response is changing/varying most, and not spend any of the allowance where it is not needed in the parts where the response isn't changing.

Below I annotate the code to explain what is being done at each step:

## here we create a data frame with the pwr and spd variables
df <- data.frame(pwr = pwr, spd = spd)

## we load the package containing the code to fit the additive model
require(mgcv)

## This is the model itself, saying pwr is modelled as a smooth function of spd
## and the smooth function of spd is generated using an adaptive smoother with
## and "allowance" of 20. This allowance is a starting point and the actual
## smoothness of the curve will be estimated as part of the model fitting,
## here using a REML criterion
mod <- gam(pwr ~ s(spd, bs = "ad", k = 20), data = df, method = "REML")

## This just summarise the model fit
summary(mod)

## In this line we are creating a new spd vector (in a data frame) that contains
## 100 equally spaced spd values over the entire range of the observed spd
x_grid <- with(df, data.frame(spd = seq(min(spd) + 0.0001, maxi, length=100)))

## we will use those data to get predictions of the response pwr at each
## of the 100 values of spd we just created
## I did this so we had enough data to plot a nice smooth curve, but without
## having to predict for all the observed values of spd
pred <- predict(mod, x_grid, se.fit = TRUE)

## This line stores the 100 predicted values in the prediction data object
x_grid <- within(x_grid, fit <- pred$fit)

## This line draws the fitted smooth on to a plot of the data
## this assumes there is already a plot on the active device.
lines(fit ~ spd, data = x_grid, col = "red", lwd = thickLineWidth)

If you are not familiar with additive models and smoothers/splines then I recommend Ruppert, Wand and Carroll (2003) Semiparametric Regression. Cambridge University Press.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453