9

I've tried searching for an answer to this problem but so far I haven't found any. I used statsmodel to implement an Ordinary Least Squares regression model on a mean-imputed dataset. I can access the list of residuals in the OLS results, but not studentized residuals. How can I calculate/get studentized residuals? I know the formula for calculating studentized residuals but I'm not exactly sure how to code this formula in Python.

Thanks in advance.

UPDATE: I've found the answer. I can get a dataframe containing the studentized residuals from the outlier_test() function from OLS reults.

Hanazono Sakura
  • 651
  • 2
  • 8
  • 12
  • 1
    Welcome to Stack Overflow! It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far, example input (if there is any), the expected output, and the output you actually get (console output, stack traces, compiler errors - whatever is applicable). The more detail you provide, the more answers you are likely to receive. – iFlo Aug 03 '17 at 13:11
  • Looking into numpy and scipy as well as the statistics package are a good place to start – Simon Hobbs Aug 03 '17 at 13:16
  • after OLS: there is a method in the Results instance, more info in the "see also" link. http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.get_influence.html – Josef Aug 03 '17 at 19:17
  • For anyone landing here in the future, statsmodels computes *Studentized Pearson Residuals*. – karlphillip May 07 '18 at 02:06

4 Answers4

5

I was dealing with the same issue. The solution is to use the statsmodels library:

from statsmodels.stats.outliers_influence import OLSInfluence

It has a resid_studentized_internal method included.

philosodad
  • 1,808
  • 14
  • 24
nimi1234
  • 178
  • 1
  • 11
3

Nodar's implementation is incorrect here is the corrected formula from https://newonlinecourses.science.psu.edu/stat501/node/339/ as well as the deleted studentized residual in case people don't want to use statsmodels package. Both formulas return the same result as the examples in the link above

def internally_studentized_residual(X,Y):
    X = np.array(X, dtype=float)
    Y = np.array(Y, dtype=float)
    mean_X = np.mean(X)
    mean_Y = np.mean(Y)
    n = len(X)
    diff_mean_sqr = np.dot((X - mean_X), (X - mean_X))
    beta1 = np.dot((X - mean_X), (Y - mean_Y)) / diff_mean_sqr
    beta0 = mean_Y - beta1 * mean_X
    y_hat = beta0 + beta1 * X
    residuals = Y - y_hat
    h_ii = (X - mean_X) ** 2 / diff_mean_sqr + (1 / n)
    Var_e = math.sqrt(sum((Y - y_hat) ** 2)/(n-2))
    SE_regression = Var_e*((1-h_ii) ** 0.5)
    studentized_residuals = residuals/SE_regression
    return studentized_residuals

def deleted_studentized_residual(X,Y):
    #formula from https://newonlinecourses.science.psu.edu/stat501/node/401/
    r = internally_studentized_residual(X,Y)
    n = len(r)
    return [r_i*math.sqrt((n-2-1)/(n-2-r_i**2)) for r_i in r]
kkawabat
  • 1,530
  • 1
  • 14
  • 37
2

Use the OLSRresults.outlier_test() function to produce a dataset that contains the studentized residual for each observation.

For example:

#import necessary packages and functions
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

#create dataset
df = pd.DataFrame({'rating': [90, 85, 82, 88, 94, 90, 76, 75, 87, 86],
                   'points': [25, 20, 14, 16, 27, 20, 12, 15, 14, 19]})

#fit simple linear regression model
model = ols('rating ~ points', data=df).fit()

#calculate studentized residuals
stud_res = model.outlier_test()

#display studentized residuals
print(stud_res)

student_resid    unadj_p     bonf(p)
0   -0.486471   0.641494    1.000000
1   -0.491937   0.637814    1.000000
2    0.172006   0.868300    1.000000
3    1.287711   0.238781    1.000000
4    0.106923   0.917850    1.000000
5    0.748842   0.478355    1.000000
6   -0.968124   0.365234    1.000000
7   -2.409911   0.046780    0.467801
8    1.688046   0.135258    1.000000
9   -0.014163   0.989095    1.000000

This tutorial provides a full explanation: https://www.statology.org/studentized-residuals-in-python/

Statology
  • 21
  • 2
  • I highly recommend using `model.get_influence().get_resid_studentized_external()`. It's much faster (much meaning 1000~10000 times faster than `outlier_test`). – Dane Lee Sep 15 '22 at 12:31
1

For a simple linear regression, you can calculate studentized residuals using following

define mean of X and Y as :

mean_X = sum(X) / len(X) 
mean_Y = sum(Y) / len(Y) 

Now you have to estimate coefficients beta_0 and beta_1

beta1 = sum([(X[i] - mean_X)*(Y[i] - mean_Y) for i in range(len(X))]) / sum([(X[i] - mean_X)**2 for i in range(len(X))]) 
beta0 = mean_Y - beta1 * mean_X

Now you need to find fitted values, by using this

y_hat = [beta0 + beta1*X[i] for i in range(len(X))]

Now compute Residuals, which is Y - Y_hat

residuals = [Y[i] - y_hat[i] for i in range(len(Y))]

We need to find H matrix which is enter image description here where X is the matrix of our independent variables.

To find leverage, we have to take the diagonal elements of H matrix, in the following way:

leverage = numpy.diagonal(H)

Find Standard Error if regression as

Var_e = sum([(Y[i] - y_hat[i])**2 for i in range(len(Y)) ]) / (len(Y) -2)
SE_regression = math.sqrt(Var_e*[(1-leverage[i]) for i in range len(leverage)])

Now you can compute Studentized Residuals

studentized_residuals = [residuals[i]/SE_regression for i in range(len(residuals))] 

Note that we have two types of studentized residuals. One is Internally Studentized Residuals and second is Externally Studentized Residuals

My solution finds Internally Studentized Residuals.

I made corrections in my calculation. For externally studentized residuals, refer @kkawabat's answer

Okroshiashvili
  • 3,677
  • 2
  • 26
  • 40
  • I believe there's a mistake in your implementation since the result from this did not match the toy example here https://newonlinecourses.science.psu.edu/stat501/node/339/ – kkawabat Jul 23 '19 at 00:43
  • you are missing the leverage term sqrt(1-h_ii) with SE_regression – kkawabat Jul 23 '19 at 01:23
  • 1
    Seems, I missed the leverage term in the calculation. Thanks, @kkawabat for correction. – Okroshiashvili Jul 23 '19 at 07:19