0

I have a scatterplot in R with time on the x axis and cost on the y axis. I want to find a constant line (y=?) that will minimize the sum of variances from all these points to the constant line. The data isn't too important (example: mtcars data), but if you would like to reference something you can use the code below.

#mtcars
plot(mtcars$wt, mtcars$disp)

i=1
j=1

sum_df <- data.frame()

for(i in as.integer(min(mtcars$disp)):as.integer(max(mtcars$disp))){
  sum_var = list()
  for(j in 1:length(mtcars$disp)){
    sum_var[[j]] <- abs(i-mtcars$disp[j])
  }
  sum_var = do.call(rbind, sum_var)
  sum_var <- sum(sum_var[,1])

  new_sum <- rbind(sum_var,sum_df)

  sum_df <- new_sum
}
row.names(sum_df)=as.integer(min(mtcars$disp)):as.integer(max(mtcars$disp))
sum_df$best_line <- ifelse(min(sum_df[,1])==sum_df[,1], "Best Line", "")
colnames(sum_df) <- c("Disp", "Abs Sum of Var")

I know I could loop through different constant lines and find the sum of variances for each and then decide which line fits best. However, I have a lot of data points and I am already looping through a lot of graphs. Is there a better way to code this besides the brute force method?

Katie
  • 362
  • 3
  • 14
  • are you simply trying to do something along the lines of .... lm(mtcars$disp~1) ... or ... lm(mtcars$disp~mtcars$wt) – greengrass62 Feb 24 '17 at 20:16
  • So, I am looking for a constant line that is the "best". This may not necessarily be the mean. So, I am trying to find something like lm(mtcars$disp~1), but that value (230.7, which is also the mean) is not the line that minimizes the sum of the variances for this data set. – Katie Feb 24 '17 at 20:58
  • What do you mean by "constant line"? Just a horizontal line? What do the `x` values even have to do with the result then? And why do you not think it's the mean? What do you mean exactly by the sum or variances? Does that just mean the absolute difference in y between the observation and the constant y? You're not squaring the distance or anything (which is far more common)? – MrFlick Feb 24 '17 at 21:33
  • Sorry, I thought I put more information in my original post. Yes, I want a horizontal line like y=3. I want the line to minimize the distance between each observation and the horizontal line, so really the x-values don't matter to me. I was asked to do it by the absolute difference between the y observation and the constant y line. That is the reason why I say it's not the mean, otherwise yes it would be the mean if I did square the distance. – Katie Feb 24 '17 at 21:57

1 Answers1

0

Your use of the terminology variance and line is a little strange to me. That said, what I think your trying to do is find the value of x such that the sum(abs(mtcars$disp-x)) is as small as possible. If that's the case, then you could try the following code

myFunction <- function(x,a) {
  sum(abs(x-a))
}

optimize(myFunction, interval=c(range(mtcars$disp)) , tol = 1e-6, a=mtcars$disp)

This does not give value that your code provides, but I hope this might be alone the lines of what you're looking for. (I got some inspiration from here.)

Community
  • 1
  • 1
greengrass62
  • 968
  • 7
  • 19
  • Yes, I am looking for something like this, so thank you! I didn't know R had an optimize function. Do you know why your code is giving a different answer? It gives the same sum of the absolute values, but the chosen line value is different. – Katie Feb 27 '17 at 16:57
  • Although I am not certain, I believe the reason is that there might not be a unique solution to the problem. – greengrass62 Feb 27 '17 at 17:29
  • I just realized that my code adds the numbers in increasing order when it needs to be decreasing (since R is looping through it starting with i=71). The answers are the same! Thank you for your help! – Katie Feb 27 '17 at 20:27