Questions tagged [linear-regression]

for issues related to linear regression modelling approach

Linear Regression is a formalization of relationships between variables in the form of mathematical equations. It describes how one or more random variables are related to one or more other variables. Here the variables are not deterministically but stochastically related.

Example

Height and age are probabilistically distributed over humans. They are stochastically related; when you know that a person is of age 30, this influences the chance of this person being 4 feet tall. When you know that a person is of age 13, this influences the chance of this person being 6 feet tall.

Model 1

heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to get a prediction of height, ε is the error term, and i is the subject

Model 2

heighti = b0 + b1agei + b2sexi + εi, where the variable sex is dichotomous

In linear regression, user data X is modelled using linear functions Y, and unknown model parameters W are estimated or learned from the data. E.g., a linear regression model for a k-dimensional user data can be represented as :

Y = w1 x1 + w2 x2 + ... + wk xk

Reading Statistical Modeling: The Two Cultures http://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

In scientific software for statistical computing and graphics, function lm (see ) implements linear regression.

6517 questions
21
votes
6 answers

AnalysisException: u"cannot resolve 'name' given input columns: [ list] in sqlContext in spark

I tried a simple example like: data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv") data.cache() # Cache data for faster reuse data =…
Elm662
  • 663
  • 1
  • 5
  • 18
20
votes
1 answer

Pandas DataFrame - 'cannot astype a datetimelike from [datetime64[ns]] to [float64]' when using ols/linear regression

I have a DataFrame as follows: Ticker Date Close 0 ADBE 2016-02-16 78.88 1 ADBE 2016-02-17 81.85 2 ADBE 2016-02-18 80.53 3 ADBE 2016-02-19 80.87 4 ADBE 2016-02-22 83.80 5 ADBE 2016-02-23 83.07 ...and so on. The Date column is…
Cole Starbuck
  • 603
  • 3
  • 11
  • 21
20
votes
2 answers

How can I plot my R Squared value on my scatterplot using R?

This seems a simple question, so I hope its a simple answer. I am plotting my points and fitting a linear model, which I can do OK. I then want to plot some summary statistics, for example the R Squared value, on the plot also. I can only seem to…
phrozenpenguin
  • 677
  • 2
  • 7
  • 7
20
votes
5 answers

How to add a line of best fit to scatter plot

I'm currently working with Pandas and matplotlib to perform some data visualization and I want to add a line of best fit to my scatter plot. Here is my code: import matplotlib import matplotlib.pyplot as plt import pandas as panda import numpy as…
JavascriptLoser
  • 1,853
  • 5
  • 34
  • 61
19
votes
3 answers

Specifying which category to treat as the base with 'statsmodels'

In understand that when I have a category variable in a model passed to a statsmodels fit that dummy variables will automatically be generated for the categories. For example if I have a variable 'Location' with values 'IndianOcean', 'Thailand',…
orome
  • 45,163
  • 57
  • 202
  • 418
18
votes
2 answers

Is there a Java library for better linear regression? (E.g., iteratively reweighted least squares)

I am struggling to find a way to perform better linear regression. I have been using the Moore-Penrose pseudoinverse and QR decomposition with JAMA library, but the results are not satisfactory. Would ojAlgo be useful? I have been hitting…
18
votes
2 answers

How to compute AIC for linear regression model in Python?

I want to compute AIC for linear models to compare their complexity. I did it as follows: regr = linear_model.LinearRegression() regr.fit(X, y) aic_intercept_slope = aic(y, regr.coef_[0] * X.as_matrix() + regr.intercept_, k=1) def aic(y, y_pred,…
YNR
  • 867
  • 2
  • 13
  • 28
18
votes
3 answers

How to check for correlation among continuous and categorical variables?

I have a dataset including categorical variables(binary) and continuous variables. I'm trying to apply a linear regression model for predicting a continuous variable. Can someone please let me know how to check for correlation among the categorical…
funnyguy
  • 229
  • 1
  • 3
  • 12
17
votes
6 answers

AttributeError: module 'statsmodels.formula.api' has no attribute 'OLS'

I am trying to use Ordinary Least Squares for multivariable regression. But it says that there is no attribute 'OLS' from statsmodels. formula. api library. I am following the code from a lecture on Udemy The code is as follows: import…
17
votes
1 answer

What is the most accurate method in python for computing the minimum norm solution or the solution obtained from the pseudo-inverse?

My goal is to solve: Kc=y with the pseudo-inverse (i.e. minimum norm solution): c=K^{+}y such that the model is (hopefully) high degree polynomial model f(x) = sum_i c_i x^i. I am specially interested in the underdetermined case where we have more…
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
17
votes
1 answer

plot.lm(): extracting numbers labelled in the diagnostic Q-Q plot

For the simple example below, you can see that there are certain points that are identified in the ensuing plots. How can I extract the row numbers identified in these plots, especially the Normal Q-Q plot? set.seed(2016) maya <-…
Reuben Mathew
  • 598
  • 4
  • 22
17
votes
4 answers

What is the BigO of linear regression?

How large a system is it reasonable to attempt to do a linear regression on? Specifically: I have a system with ~300K sample points and ~1200 linear terms. Is this computationally feasible?
BCS
  • 75,627
  • 68
  • 187
  • 294
16
votes
2 answers

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample

While I am predicting the one sample from my data, it gives reshape error but my model has equal number of rows. Here is my code: import pandas as pd from sklearn.linear_model import LinearRegression import numpy as np x = np.array([2.0 , 2.4, 1.5,…
user11585758
16
votes
2 answers

Linear regression with dummy/categorical variables

I have a set of data. I have use pandas to convert them in a dummy and categorical variables respectively. So, now I want to know, how to run a multiple linear regression (I am using statsmodels) in Python?. Are there some considerations or maybe I…
16
votes
1 answer

Using a smoother with the L Method to determine the number of K-Means clusters

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater…
winwaed
  • 7,645
  • 6
  • 36
  • 81