0

DF

 times a   b   s  ex  
1   0 59 140 1e-4  1
2  20 59 140 1e-4  0 
3  40 59 140 1e-4  0
4  60 59 140 1e-4  2
5 120 59 140 1e-4 20
6 180 59 140 1e-4 30
7 240 59 140 1e-4 31
8 360 59 140 1e-4 37
9   0 60 140 1e-4  0
10 20 60 140 1e-4  0
11 40 60 140 1e-4  0
12 60 60 140 1e-4  0
13 120 60 140 1e-4 3300
14 180 60 140 1e-4 6600
15 240 60 140 1e-4 7700
16 360 60 140 1e-4 7700
# dput(DF) 
structure(list(times = c(0, 20, 40, 60, 120, 180, 240, 360, 0, 
20, 40, 60, 120, 180, 240, 360), a = c(59, 59, 59, 59, 59, 59, 
59, 59, 60, 60, 60, 60, 60, 60, 60, 60), b = c(140, 140, 140, 
140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140, 140
), s = c(1e-04, 1e-04, 1e-04, 1e-04, 1e-04, 1e-04, 1e-04, 1e-04, 
1e-04, 1e-04, 1e-04, 1e-04, 1e-04, 1e-04, 1e-04, 1e-04), ex = c(1, 
0, 0, 2, 20, 30, 31, 37, 0, 0, 0, 0, 3300, 6600, 7700, 7700)), .Names = c("times", 
"a", "b", "s", "ex"), row.names = c(NA, 16L), class = "data.frame")

DF2

prime    times       mean     
 g1          0  1.0000000 
 g1         20  0.7202642 
 g1         40  0.8000305 
 g1         60  1.7430986 
 g1        120 16.5172242 
 g1        180 25.6521268         
 g1        240 33.9140056 
 g1        360 34.5735984 
 #dput(DF2)
 structure(list(times = c(0, 20, 40, 60, 120, 180, 240, 360), 
mean = c(1, 0.7202642, 0.8000305, 1.7430986, 16.5172242, 
25.6521268, 33.9140056, 34.5735984)), .Names = c("times", 
"mean"), row.names = c(NA, -8L), class = "data.frame")

DF is an example of a larger data frame which actually has hundreds of combinations of the 'a','b', and 's' values which result in different 'ex' values. What I want to do is find the combination of 'a','b', and 's' whose 'ex' values (DF) best fit the 'mean' values (DF2) at equivalent 'times'. This fitting will be a comparison of 8 values at a time (ie, times == c(0,20,40,60,120,180,240,360).

In this example, I would want 59, 140, and 1e-4 for the 'a', 'b', and 's' values, because those 'ex' values (DF) best fit the 'mean' values (DF2).

I would like 'a','b', and 's' values for those values which 'ex' (DF) best fits 'mean' (DF2)

Since I want one possible combination of the 'a','b', and 's' values a linear least squares fit model would be best. I would be comparing 8 values at a time -- where 'times' == 0 - 360. I don't want 'a', 'b', and 's' values which work best for each individual time point. I want 'a', 'b', and 's' values where all 8 'ex' (DF) best fit all 8 'mean' values (DF2) This is where I need help.

I have never used linear least squares fitting, but I assume what I'm trying to do is possible.

      lm(DF2$mean ~ DF$ex,....) # i'm not sure if I should combine the two 
      # data frames first then use that as my data argument, then 
      # where I would include 'times' as the point of comparison, 
      # if that would be used in subset?   
PeeHaa
  • 71,436
  • 58
  • 190
  • 262
Doug
  • 597
  • 2
  • 7
  • 22

1 Answers1

1

It sounds like a linear model is not what you need here. A linear model will in the best case give you a linear combination of different a/b/s configurations, not the single best matching combination. Thus the term linear in that name.

I take it that you have some guarantee that the times values of DF will match the times values of DF2. One first step might be turning DF into a dataframe where there is only one row for every a/b/s combination, and the different ex values are stored as the columns of a matrix. Then for each row, you'd want to subtract the ex values from the DF2$mean values, square those differences, and add them together, to compute a single square error for the row. Then simply select the row with minimal value.

The above solution is pretty vague. There are a million ways to actually implement this, and instead of copying my solution, you might be better off writing them yourself, in the way you best understand them. Some hints how to achieve the individual steps:

  • matrix(DF$ex, byrow=TRUE, ncol=8) can compute the matrix
  • DF[seq(from=1, to=nrow(DF), by=8),2:4] will provide the a/b/s values corresponding to each of the matrix rows
  • cbind can be used to combine these two
  • matrix(DF2$mean, byrow=TRUE, ncol=8, nrow=nrow(DF)/8) will turn those means into a matrix which you can simply subtract
  • **2 will square all components of a matrix
  • rowSums will add the elements of a row of a matrix
  • which.min will return the index of the minimal value

Putting it all together in one possible way, putting everything in a single expression without using intermediate variables (not the most readable solution):

DF[seq(from=1, to=nrow(DF), by=8),2:4][which.min(
  rowSums((matrix(DF$ex, byrow=TRUE, ncol=8) -
           matrix(DF2$mean, byrow=TRUE, ncol=8, nrow=nrow(DF)/8)
          )**2
         )
),]

If you don't store the matrix as part of a data frame, you might want to transpose it to avoid those byrow=TRUE arguments and leverage the fact that a vector will be repeated for every column in a matrix-vector subtraction:

DF[seq(from=1, to=nrow(DF), by=8),2:4][which.min(
  colSums((matrix(DF$ex, nrow=8) - DF2$mean)**2)),]
MvG
  • 57,380
  • 22
  • 148
  • 276
  • I need to work through this, but this is linear least squares fitting. Not linear regression what I was doing before? I was under the impression that R had some built in function that would actually do this? Guess not, but this looks good, I'll try it out. – Doug Sep 05 '12 at 01:06
  • @LucasPinto: Linear least squares fitting and linear regression sound pretty much the same, but this is neither. If you had to choose a term, I'd say this is *discrete* least squares fitting: from a set of distinct parameter combinations, you choose the one which results in the least square error. – MvG Sep 05 '12 at 05:50
  • I think the correct term is *nearest neighbor search*: http://en.wikipedia.org/wiki/Nearest_neighbor_search – flodel Sep 05 '12 at 10:24
  • @flodel, I hadn't though about it in in those terms, but you're right, this is a nearest neighbor search. Thanks for providing this term! There might even be some suitable R implementation to help with this task, but given the fact that 8 dimensions isn't that many, and most nearest neighbor algos apparently do k-nearest-neighbor, I guess that might be overkill. – MvG Sep 05 '12 at 11:23
  • @MvG you know, I wonder if there would be a way to fit to individual 'times', this code is fitting to 8 time points, but this wouldn't best fit in the case of an outlier for example. Maybe we can continue this discussion in chat? – Doug Sep 05 '12 at 23:12
  • We can try. I created a [chat room](http://chat.stackoverflow.com/rooms/16345/lucas-pinto-and-mvg) for this. – MvG Sep 06 '12 at 07:38