0

So I have a data frame of residuals as such:

ID    A    B    C     D 
1    .2 23.3  2.3  4.32
2   2.3  2.4    0     1
3  23.3  1.3   23  3.44
2  34.2   33 56.5  76.5
1   0.3 76.4  3.2  78.5

*There are about 200 variables

How would I calculate Mean Squared Error (MSE) using this by ID?

So basically, the MSE for each individual ID is the goal.

Werner Hertzog
  • 2,002
  • 3
  • 24
  • 36
John Thomas
  • 1,075
  • 9
  • 32
  • Do you want to calculate the MSE for each colum per individual, so the MSE for column A, and another MSE for column B or do you want to calculate an overall MSE, so per ID, but across all columns? Please explain. Also explain how exactly you want to calculate the MSE as there are different formula in different contexts. – deschen Nov 11 '20 at 17:47
  • Overall MSE , per ID. The MSE formular of SUM((X - Xi)^2) @deschen – John Thomas Nov 11 '20 at 18:03

3 Answers3

1
library(tidyverse)
df_example %>%
  group_by(ID) %>%
  summarize(across(everything(), ~sum(.x^2)/n()))

which gives:

# A tibble: 3 x 5
     ID       A       B       C      D
  <int>   <dbl>   <dbl>   <dbl>  <dbl>
1     1   0.065 3190.      7.76 3090. 
2     2 587.     547.   1596.   2927. 
3     3 543.       1.69  529      11.8

Note that this gives different results compared to @Bruno's solution. It does give the same results, though, as Neeraj's solution.

I understand the TO in a way that his input already are the residuals in which case I only need to square each of them, create the sum per ID (and for each column) and divide by the observations per ID.

One example for column "A" and ID 2:

  • Residuals are 2.3 and 34.2
  • Squared residuals are 5.29 and 1169.64
  • Sum of squared residuals is 1174.93
  • MSE is sum of squared residuals divided by 2 = 587.465

Is that correct?

deschen
  • 10,012
  • 3
  • 27
  • 50
0

The mean squared error for residuals is simply defined as:

MSE for residuals

To get MSE for each column you can simply use apply function in R:

df <- matrix(runif(100), ncol = 10) #dummy data

#generating ID in first column
set.seed(123)
df <- cbind(sample(1:3, 10, replace = TRUE), df)

mse <- aggregate(df[, 2:ncol(df)], by = list(df[, 1]), FUN = function(x) 1/length(x) * sum(x ** 2))
Neeraj
  • 1,166
  • 9
  • 21
0

Something like this is what I would do

library(tidyverse)

df_example <- tibble::tribble(
  ~ID,   ~A,   ~B,   ~C,   ~D,
   1L,  0.2, 23.3,  2.3, 4.32,
   2L,  2.3,  2.4,    0,    1,
   3L, 23.3,  1.3,   23, 3.44,
   2L, 34.2,   33, 56.5, 76.5,
   1L,  0.3, 76.4,  3.2, 78.5
  )

df_example %>% 
  group_by(ID) %>% 
  summarise(lenght_vector = c_across(cols = c(A:D)) %>% length(),
            sum_vector = c_across(cols = c(A:D)) %>% sum(),
            mean_error = sum_vector/lenght_vector,
            MSE = mean_error %>% sqrt())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 5
#>      ID lenght_vector sum_vector mean_error   MSE
#>   <int>         <int>      <dbl>      <dbl> <dbl>
#> 1     1             8      189.        23.6  4.85
#> 2     2             8      206.        25.7  5.07
#> 3     3             4       51.0       12.8  3.57

Created on 2020-11-11 by the reprex package (v0.3.0)

Bruno
  • 4,109
  • 1
  • 9
  • 27