0

I am looking for a way to score model fits of x,y data on a scale from 0 to 1, depending on which model the data best fit. If data fit a linear model best, the score would come out close to 1. Log would be close to 0, and quadratic would be intermediate (0.5, for example).

My problem is that the function below does not differentiate between linear and quadratic models - any ideas on how to fix this? I might be overlooking something simple or going about this the wrong way. I am using R. Sample code and function below;

`# Functions
sat <- function(x, alpha) (1-alpha)*x + alpha*log(x)

lik.sat <- function(alpha, x, y) as.numeric(logLik(lm(y ~ sat(x,alpha))))

est.sat <- function(x, y){
  fit <- optim(par=.5, fn=lik.sat, method="Brent", control=list(fnscale=-1), lower=0, upper=1, x=x,   y=y)
  if(fit$convergence==0)
    return(fit$par)
  return(NA)
}

# Simulate data
#Linear model
x<- abs(rnorm(100))
lin.y <- x + rnorm(100, sd=.25)

#quad model
quad.y <- x + x^2 +  rnorm(100, sd=.25)

#log model
log.y <- log(x) + rnorm(100, sd=.25)

# Fit data
lik.sat(0, x, lin.y)
lik.sat(1, x, lin.y)
lik.sat(0, x, quad.y)   
lik.sat(1, x, quad.y)
lik.sat(0, x, log.y)
lik.sat(1, x, log.y)

# Estimate transformation
est.sat(x, lin.y)  ## close to 1
est.sat(x, quad.y)  #### DOESN'T WORK!
est.sat(x, log.y)  ## close to 0`
Emma
  • 1
  • 1

1 Answers1

0

This seems problematic because quadratic has 3 coefficients and the others have 2. Instead of trying to to use a continuous parameter just test each of the 3 models to see which gives the lowest AIC. In each of the three examples below it chooses the correct form.

set.seed(123)
x<- abs(rnorm(100))
lin.y <- x + rnorm(100, sd=.25)
quad.y <- x + x^2 +  rnorm(100, sd=.25)
log.y <- log(x) + rnorm(100, sd=.25)
 
fos <- list(log = log.y ~ log(x), lin = log.y ~ x, quad = log.y ~ poly(x, 2))
sapply(fos, function(fo) AIC(lm(fo))) # log has lowest AIC
##       log       lin      quad 
##  17.15268 203.86473 136.72457 

fos <- list(log = lin.y ~ log(x), lin = lin.y ~ x, quad = lin.y ~ poly(x, 2))
sapply(fos, function(fo) AIC(lm(fo))) # lin has lowest AIC
##       log       lin      quad 
## 81.577530  3.408632  5.407171 

fos <- list(log = quad.y ~ log(x), lin = quad.y ~ x, quad = quad.y ~ poly(x, 2))
sapply(fos, function(fo) AIC(lm(fo))) # quad has lowest AIC
##      log      lin     quad 
## 313.2159 110.9217   1.0147 

Note that this can be made automatic and is superior to using nls since lm is much more stable.

We have returned the name of the chosen model but it could readily be modified to return the formula or the lm object depending on what is wanted. In those cases call it using lapply and Map instead of sapply and mapply.

opt <- function(x, y) {
  nm <- if (length(unique(x)) < 3) {
   "lin"
  } else {
    fos <- list(log = y ~ log(x), lin = y ~ x, quad = y ~ poly(x, 2))
    s <- sapply(fos, function(fo) AIC(lm(fo)))
    names(s)[which.min(s)]
  }
  nm

  # optionally uncomment 1 of next 2 lines
  # fos[[nm]]
  # lm(fos[[nm]])

}

sapply(list(A = log.y, B = lin.y, C = quad.y), opt, x = x)
##      A      B      C 
##  "log"  "lin" "quad" 

Another example. This one uses the built-in anscombe data set.

mapply(opt, anscombe[1:4], anscombe[5:8])
##     x1     x2     x3     x4 
##  "log" "quad"  "lin"  "lin" 
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks for your answer. I wanted to use a scale function because I am analysing many datasets (thousands) for best fit and wanted a simple way to visualise differences in trends across categories of datasets. AIC definitely works, but I would like to use a sliding 0-1 scale rather than using grouping model data into log, linear, and quadratic categories (if possible). – Emma May 20 '23 at 09:06
  • I have added some code that shows that this can be made completely automatic. – G. Grothendieck May 20 '23 at 12:34