Questions tagged [linear-regression]

for issues related to linear regression modelling approach

Linear Regression is a formalization of relationships between variables in the form of mathematical equations. It describes how one or more random variables are related to one or more other variables. Here the variables are not deterministically but stochastically related.

Example

Height and age are probabilistically distributed over humans. They are stochastically related; when you know that a person is of age 30, this influences the chance of this person being 4 feet tall. When you know that a person is of age 13, this influences the chance of this person being 6 feet tall.

Model 1

heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to get a prediction of height, ε is the error term, and i is the subject

Model 2

heighti = b0 + b1agei + b2sexi + εi, where the variable sex is dichotomous

In linear regression, user data X is modelled using linear functions Y, and unknown model parameters W are estimated or learned from the data. E.g., a linear regression model for a k-dimensional user data can be represented as :

Y = w1 x1 + w2 x2 + ... + wk xk

Reading Statistical Modeling: The Two Cultures http://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

In scientific software for statistical computing and graphics, function lm (see ) implements linear regression.

6517 questions
14
votes
2 answers

Use "colon" between two characters as a regressor in lm()

What does it mean when we put a colon : between two characters? I'm sure it's not saying from character A to character B. Here is the code: fit9=lm(Sales~.+Income:Advertising+Price:Age,data=Carseats) Coefficients: Estimate …
Sheryl
  • 721
  • 1
  • 9
  • 17
14
votes
3 answers

How to use formula in R to exclude main effect but retain interaction

I do not want main effect because it is collinear with a finer factor fixed effect, so it is annoying to have these NA. In this example: lm(y ~ x * z) I want the interaction of x (numeric) and z (factor), but not the main effect of z.
wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90
14
votes
3 answers

Can we use Normal Equation for Logistic Regression ?

Just like we use the Normal Equation to find out the optimum theta value in Linear Regression, can/can't we use a similar formula for Logistic Regression ? If not, why ? I'd be grateful if could someone could explain the reasoning behind it. Thank…
user2125722
  • 1,289
  • 3
  • 18
  • 29
14
votes
3 answers

OLS using statsmodel.formula.api versus statsmodel.api

Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api? Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's…
Chetan Prabhu
  • 580
  • 3
  • 6
  • 10
14
votes
1 answer

Converting Numpy Lstsq residual value to R^2

I am performing a least squares regression as below (univariate). I would like to express the significance of the result in terms of R^2. Numpy returns a value of unscaled residual, what would be a sensible way of normalizing…
whatnick
  • 5,400
  • 3
  • 19
  • 35
14
votes
1 answer

sklearn LinearRegression, why only one coefficient returned by the model?

I'm trying out scikit-learn LinearRegression model on a simple dataset (comes from Andrew NG coursera course, I doesn't really matter, look the plot for reference) this is my script import numpy as np import matplotlib.pyplot as plt from…
JackNova
  • 3,911
  • 5
  • 31
  • 49
14
votes
4 answers

R-squared on test data

I fit a linear regression model on 75% of my data set that includes ~11000 observations and 143 variables: gl.fit <- lm(y[1:ceiling(length(y)*(3/4))] ~ ., data= x[1:ceiling(length(y)*(3/4)),]) #3/4 for training and I got an R^2 of 0.43. I then…
H_A
  • 667
  • 2
  • 6
  • 13
14
votes
1 answer

"weighted" regression in R

I have created a script like the one below to do something I called as "weighted" regression: library(plyr) set.seed(100) temp.df <- data.frame(uid=1:200, bp=sample(x=c(100:200),size=200,replace=TRUE), …
lokheart
  • 23,743
  • 39
  • 98
  • 169
13
votes
1 answer

Fast pairwise simple linear regression between variables in a data frame

I have seen pairwise or general paired simple linear regression many times on Stack Overflow. Here is a toy dataset for this kind of problem. set.seed(0) X <- matrix(runif(100), 100, 5, dimnames = list(1:100, LETTERS[1:5])) b <- c(1, 0.7, 1.3, 2.9,…
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
13
votes
2 answers

Multiple Linear Regression with specific constraint on each coefficients on Python

I am currently running multiple linear regression on a dataset. At first, I didn't realize I needed to put constraints over my weights; as a matter of fact, I need to have specific positive & negative weights. To be more precise, I am doing a…
13
votes
1 answer

Why `sklearn` and `statsmodels` implementation of OLS regression give different R^2?

Accidentally I have noticed, that OLS models implemented by sklearn and statsmodels yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields: import numpy as np import sklearn import…
abukaj
  • 2,582
  • 1
  • 22
  • 45
13
votes
1 answer

How `poly()` generates orthogonal polynomials? How to understand the "coefs" returned?

My understanding of orthogonal polynomials is that they take the form y(x) = a1 + a2(x - c1) + a3(x - c2)(x - c3) + a4(x - c4)(x - c5)(x - c6)... up to the number of terms desired where a1, a2 etc are coefficients to each orthogonal term (vary…
pyg
  • 716
  • 6
  • 18
13
votes
2 answers

Why the built-in lm function is so slow in R?

I always thought that the lm function was extremely fast in R, but as this example would suggest, the closed solution computed using the solve function is way faster. data<-data.frame(y=rnorm(1000),x1=rnorm(1000),x2=rnorm(1000)) X =…
adaien
  • 1,932
  • 1
  • 12
  • 26
13
votes
2 answers

Regression (logistic) in R: Finding x value (predictor) for a particular y value (outcome)

I've fitted a logistic regression model that predicts the a binary outcome vs from mpg (mtcars dataset). The plot is shown below. How can I determine the mpg value for any particular vs value? For example, I'm interested in finding out what the mpg…
hsl
  • 670
  • 2
  • 10
  • 22
13
votes
4 answers

How to do linear regression, taking errorbars into account?

I am doing a computer simulation for some physical system of finite size, and after this I am doing extrapolation to the infinity (Thermodynamic limit). Some theory says that data should scale linearly with system size, so I am doing linear…
Vladimir
  • 369
  • 1
  • 3
  • 12