Questions tagged [linear-regression]

for issues related to linear regression modelling approach

Linear Regression is a formalization of relationships between variables in the form of mathematical equations. It describes how one or more random variables are related to one or more other variables. Here the variables are not deterministically but stochastically related.

Example

Height and age are probabilistically distributed over humans. They are stochastically related; when you know that a person is of age 30, this influences the chance of this person being 4 feet tall. When you know that a person is of age 13, this influences the chance of this person being 6 feet tall.

Model 1

heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to get a prediction of height, ε is the error term, and i is the subject

Model 2

heighti = b0 + b1agei + b2sexi + εi, where the variable sex is dichotomous

In linear regression, user data X is modelled using linear functions Y, and unknown model parameters W are estimated or learned from the data. E.g., a linear regression model for a k-dimensional user data can be represented as :

Y = w1 x1 + w2 x2 + ... + wk xk

Reading Statistical Modeling: The Two Cultures http://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

In scientific software for statistical computing and graphics, function lm (see ) implements linear regression.

6517 questions
13
votes
2 answers

Vector autoregressive model fitting with scikit-learn

I am trying to fit vector autoregressive (VAR) models using the generalized linear model fitting methods included in scikit-learn. The linear model has the form y = X w, but the system matrix X has a very peculiar structure: it is block-diagonal,…
MB-F
  • 22,770
  • 4
  • 61
  • 116
12
votes
4 answers

segmented linear regression in python

Is there a library in python to do segmented linear regression? I'd like to fit multiple lines to my data automatically to get something like this: Btw. I do know the number of segments.
P3trus
  • 6,747
  • 8
  • 40
  • 54
12
votes
3 answers

drop_First=true during dummy variable creation in pandas

I have months(Jan, Feb, Mar etc) data in my dataset and I am generating dummy variable using pandas library. pd.get_dummies(df['month'],drop_first=True) I want to understand whether I should use drop_first=True or not in this case? Why is it…
Snehal Gupta
  • 314
  • 1
  • 2
  • 13
12
votes
1 answer

Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable only

I am performing component wise regression on a time series data. This is basically where instead of regressing y against x1, x2, ..., xN, we would regress y against x1 only, y against x2 only, ..., and take the regression that reduces the sum of…
Lim Kaizhuo
  • 714
  • 3
  • 7
  • 16
12
votes
1 answer

How to remove RunTimeWarning Errors from code?

I keep getting RuntimeWarning when I run the regression code at the very bottom. I am not sure how to fix them. I believe it may be the attencoef list because there is some nan values in it. Any suggestions? These are the errors I am…
Adam
  • 419
  • 1
  • 6
  • 14
12
votes
2 answers

statsmodels add_constant for OLS intercept, what is this actually doing?

Reviewing linear regressions via statsmodels OLS fit I see you have to use add_constant to add a constant '1' to all your points in the independent variable(s) before fitting. However my only understanding of intercepts in this context would be the…
Tim Lindsey
  • 727
  • 1
  • 7
  • 18
12
votes
1 answer

Multivariate Regression Neural Network Loss Function

I am doing multivariate regression with a fully connected multilayer neural network in Tensorflow. The network predicts 2 continuous float variables (y1,y2) given an input vector (x1,x2,...xN), i.e. the network has 2 output nodes. With 2 outputs the…
Ron Cohen
  • 2,815
  • 5
  • 30
  • 45
12
votes
2 answers

Print OLS regression summary to text file

I am running OLS regression using pandas.stats.api.ols using a groupby with the following code: from pandas.stats.api import ols df=pd.read_csv(r'F:\file.csv') result=df.groupby(['FID']).apply(lambda d: ols(y=d.loc[:, 'MEAN'], x=d.loc[:,…
Stefano Potter
  • 3,467
  • 10
  • 45
  • 82
12
votes
2 answers

Optimal two variable linear regression calculation

Problem Am looking to apply the y = mx + b equation (where m is SLOPE, b is INTERCEPT) to a data set, which is retrieved as shown in the SQL code. The values from the (MySQL) query are: SLOPE = 0.0276653965651912 INTERCEPT = -57.2338357550468 SQL…
Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
12
votes
2 answers

Create lm object from data/coefficients

Does anyone know of a function that can create an lm object given a dataset and coefficients? I'm interested in this because I started playing with Bayesian model averaging (BMA) and I'd like to be able to create an lm object out of the results of…
Bob Albright
  • 2,242
  • 2
  • 25
  • 32
11
votes
3 answers

scikit-learn & statsmodels - which R-squared is correct?

I'd like to choose the best algorithm for future. I found some solutions, but I didn't understand which R-Squared value is correct. For this, I divided my data into two as test and training, and I printed two different R squared values…
11
votes
3 answers

plot regression line in R

I want to plot a simple regression line in R. I've entered the data, but the regression line doesn't seem to be right. Can someone help? x <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120) y <- c(10, 18, 25, 29, 30, 28, 25, 22, 18, 15, 11,…
J.doe
  • 225
  • 1
  • 2
  • 9
11
votes
1 answer

how to create DataFrame from multiple arrays in Spark Scala?

val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278) val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5,…
Sam
  • 1,227
  • 3
  • 11
  • 13
11
votes
3 answers

Linear Regression with positive coefficients in Python

I'm trying to find a way to fit a linear regression model with positive coefficients. The only way I found is sklearn's Lasso model, which has a positive=True argument, but doesn't recommend using with alpha=0 (means no other constraints on the…
Oren
  • 258
  • 1
  • 2
  • 10
11
votes
1 answer

Multi Collinearity for Categorical Variables

For Numerical/Continuous data, to detect Collinearity between predictor variables we use the Pearson's Correlation Coefficient and make sure that predictors are not correlated among themselves but are correlated with the response variable. But How…
karthik subramanian
  • 153
  • 1
  • 2
  • 11