Calculated Mean Squared Error with dataframe of residuals with grouping in dplyr

Question

So I have a data frame of residuals as such:

ID    A    B    C     D 
1    .2 23.3  2.3  4.32
2   2.3  2.4    0     1
3  23.3  1.3   23  3.44
2  34.2   33 56.5  76.5
1   0.3 76.4  3.2  78.5

*There are about 200 variables

How would I calculate Mean Squared Error (MSE) using this by ID?

So basically, the MSE for each individual ID is the goal.

Do you want to calculate the MSE for each colum per individual, so the MSE for column A, and another MSE for column B or do you want to calculate an overall MSE, so per ID, but across all columns? Please explain. Also explain how exactly you want to calculate the MSE as there are different formula in different contexts. — deschen, Nov 11 '20 at 17:47
Overall MSE , per ID. The MSE formular of SUM((X - Xi)^2) @deschen — John Thomas, Nov 11 '20 at 18:03

deschen · Accepted Answer · 2020-11-11T18:33:42.253

library(tidyverse)
df_example %>%
  group_by(ID) %>%
  summarize(across(everything(), ~sum(.x^2)/n()))

which gives:

# A tibble: 3 x 5
     ID       A       B       C      D
  <int>   <dbl>   <dbl>   <dbl>  <dbl>
1     1   0.065 3190.      7.76 3090. 
2     2 587.     547.   1596.   2927. 
3     3 543.       1.69  529      11.8

Note that this gives different results compared to @Bruno's solution. It does give the same results, though, as Neeraj's solution.

I understand the TO in a way that his input already are the residuals in which case I only need to square each of them, create the sum per ID (and for each column) and divide by the observations per ID.

One example for column "A" and ID 2:

Residuals are 2.3 and 34.2
Squared residuals are 5.29 and 1169.64
Sum of squared residuals is 1174.93
MSE is sum of squared residuals divided by 2 = 587.465

Is that correct?

So is the result above correct or not? If so it woudl be great if you could mark one of tehe correct answer as teh accepted one. — deschen, Nov 11 '20 at 22:58

Neeraj · Answer 2 · 2020-11-11T18:14:12.987

0

The mean squared error for residuals is simply defined as:

To get MSE for each column you can simply use apply function in R:

df <- matrix(runif(100), ncol = 10) #dummy data

#generating ID in first column
set.seed(123)
df <- cbind(sample(1:3, 10, replace = TRUE), df)

mse <- aggregate(df[, 2:ncol(df)], by = list(df[, 1]), FUN = function(x) 1/length(x) * sum(x ** 2))

edited Nov 11 '20 at 18:14

answered Nov 11 '20 at 18:00

Neeraj

1,166
9
21

Ok, I understood your problem. – Neeraj Nov 11 '20 at 18:04
You need to do `aggregate(.~ID, df, function(x) mean(x^2))` – Onyambu Nov 11 '20 at 18:22

Bruno · Answer 3 · 2020-11-11T18:13:25.683

Something like this is what I would do

library(tidyverse)

df_example <- tibble::tribble(
  ~ID,   ~A,   ~B,   ~C,   ~D,
   1L,  0.2, 23.3,  2.3, 4.32,
   2L,  2.3,  2.4,    0,    1,
   3L, 23.3,  1.3,   23, 3.44,
   2L, 34.2,   33, 56.5, 76.5,
   1L,  0.3, 76.4,  3.2, 78.5
  )

df_example %>% 
  group_by(ID) %>% 
  summarise(lenght_vector = c_across(cols = c(A:D)) %>% length(),
            sum_vector = c_across(cols = c(A:D)) %>% sum(),
            mean_error = sum_vector/lenght_vector,
            MSE = mean_error %>% sqrt())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 5
#>      ID lenght_vector sum_vector mean_error   MSE
#>   <int>         <int>      <dbl>      <dbl> <dbl>
#> 1     1             8      189.        23.6  4.85
#> 2     2             8      206.        25.7  5.07
#> 3     3             4       51.0       12.8  3.57

^{Created on 2020-11-11 by the reprex package (v0.3.0)}

Calculated Mean Squared Error with dataframe of residuals with grouping in dplyr

3 Answers3