I want to compute the external studentized residuals of a dataset {x,y} of size n in R given the following constraints:
- (very) high precision
- high performance (avoiding loops where possible)
- R language (including RCPP)
The R code should be fast because it will be used extensively (10^9 times minimum) on multiple data sets with n in [10^3, 10^6]. This question is part of a larger work for estimating a custom statistic that requires the studentized residuals. The most computational part is the one presented here. Thus, solving this would dramatically improve the overall efficiency.
On the lm() regression
To gather the studentized external residuals, one typically runs lm()
then rstudent()
. The R function uses an aproach that avoid running n regressions for estimating the studentized residuals and that saves a lot of execution time. However, I prefer not to use lm()
because I only need the residuals without all that fancy additional stuff that comes with it (thus saving some more execution time).
When trying to decipher the R source code for the external residuals in the lm()
I found it somewhat obscur, as it seems to call sample code from other external files (an example is the influence()
function). Thus, at this time I failed at collecting enough information to replicate the code section using the source code only.
Relevant topic(s) on Stack
The following relevant topic has been found in Stack: How to compute Studentized Residuals in Python?
A R implementation of the Python procedure including a minimal example is given (corrected by @Stéphane Laurent, see answers):
n = 10
set.seed(1)
x = rnorm(n)
y = rnorm(n)
m = 2
mean_y = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_y) %*% (y - mean_y)
beta_1 = ((y - mean_y) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_y
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_y) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm.fit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
reg = rstudent(lm(x ~ y))
res = cbind(reg, studentized_residuals)
Produce the following differences:
index reg studentized_residuals
1 -0,595911898846465 -0,581348373714385
2 0,116208945967327 0,116097011762269
3 -2,04779452591111 -1,61939642040734
4 2,26350621688535 1,71995630000724
5 0,603322309518977 0,588222428131761
6 -1,5460639774285 -1,33486217871738
7 0,367900050364855 0,364393996552621
8 1,14745971090533 1,05271762293388
9 0,823888320713653 0,786630743176311
10 -0,449839343257121 -0,443475039943641
Minimal example
The following R attemp has been tested using arbitrary datasets, just for illustration purposes.
It uses lm()
/ rstudent()
and is way too slow for our practical application. The two parameters n1
and n2
correspond to the number of iterations and the size of the vector (denoted n in the above) respectively. To match our problem, we would typically pick n1
in [10^6, 10^9] and n2
in [10^3, 10^6] :
Stud = function(n1, n2){
res = data.frame(matrix(vector(), n2, n1))
for(i in 1 : n1){
x = rnorm(n2)
y = rnorm(n2)
reg = lm(x ~ y)
res[, i] = rstudent(reg)
}
}
Update and additional (full) minimal example for benchmark:
Here we show a complete benchmark where various functions of Stack are tested against lm()
in the objective of gathering the studentized externals residuals. For gathering these residuals we need to run 'n' regressions. Results are given after the code for 100 and 500 replications.
#Packages
install.packages("Rcpp")
library(Rcpp)
install.packages("RcppArmadillo")
library(RcppArmadillo)
install.packages("RcppEigen")
library(RcppEigen)
install.packages("stats")
library(stats)
install.packages("speedglm")
library(speedglm)
install.packages("Rfast")
library(Rfast)
install.packages("rbenchmark")
library(rbenchmark)
## start from SEXP, most conversions, longest code
src <- '
Rcpp::List fLmSEXP(SEXP Xs, SEXP ys) {
Rcpp::NumericMatrix Xr(Xs);
Rcpp::NumericVector yr(ys);
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false);
arma::colvec y(yr.begin(), yr.size(), false);
int df = n - k;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Rcpp types are early RcppArmadillo examples did
src <- '
Rcpp::List fLmTwoCasts(Rcpp::NumericMatrix Xr, Rcpp::NumericVector yr) {
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false);
arma::colvec y(yr.begin(), yr.size(), false);
int df = n - k;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Armadillo types
src <- '
Rcpp::List fLmOneCast(arma::mat X, arma::colvec y) {
int df = X.n_rows - X.n_cols;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
## start from Armadillo types passed as constant references
src <- '
Rcpp::List fLmConstRef(const arma::mat & X, const arma::colvec & y) {
int df = X.n_rows - X.n_cols;
// fit model y ~ X, extract residuals
arma::colvec coef = arma::solve(X, y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,
Rcpp::Named("stderr") =sderr,
Rcpp::Named("df") =df,
Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src, depends="RcppArmadillo")
#Benchmark
data = benchmark("OneCast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmOneCast(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"TwoCast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmTwoCasts(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Const" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmConstRef(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Sexp" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fLmSEXP(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Fast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- fastLm(x[-i] ~ y[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Speed" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- speedlm(x[-i] ~ y[-i], fitted = T)
sum((x[-i] - fit$fitted.values)^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
".Fit" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm.fit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Fit" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lmfit(cbind(1, y[-i]), x[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Lm" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n, function(i){
fit <- lm(x[-i] ~ y[-i])
sum(fit$residuals^2)
}, numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},
"Basic" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
reg <- lm(x ~ y)
reg_stud <- rstudent(reg)
},
replications = 500,
columns = c("test", "elapsed", "replications"))
Results:
On this single benchmark, the rstudent(lm())
is much faster than everything else:
test elapsed replications
7 .Fit 13.84 100
10 Basic 0.25 100
3 Const 7.37 100
5 Fast 99.84 100
8 Fit 7.06 100
9 Lm 105.25 100
1 OneCast 7.61 100
4 Sexp 7.66 100
6 Speed 184.76 100
2 TwoCast 7.17 100
7 .Fit 63.63 500
10 Basic 0.93 500
3 Const 34.44 500
5 Fast 438.95 500
8 Fit 31.11 500
9 Lm 471.37 500
1 OneCast 34.29 500
4 Sexp 33.48 500
6 Speed 794.73 500
2 TwoCast 33.51 500
Interpretation
It seems that R uses an analytical alternative that avoid using 'n' regressions, resulting in a much faster computation.
Thus, the question still remains: How to be competitive in regards to rstudent(lm())
, and how to reverse-engeering the original source code (that is difficult to gather) ?
Final results
We compared the solutions of @Onyambu, @tester and @Stéphane Laurent. We found the solution of @Onyambu to be the fastest one for different vector sizes, while providing results exactly equal to those of rstudent()
.