Questions tagged [regression]

Regression analysis is a collection of statistical techniques for modeling and predicting one or multiple variables based on other data.

Wiki

Regression is a common applied statistical technique and a cornerstone of machine learning. Various algorithms and software packages can be used to fit and use regression models.

In other words, regression is a statistical measure that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables). Typically the dependent variables are modeled with probability distributions whose parameters are assumed to vary (deterministically) with the independent variables.

Tag usage

Questions on should be about implementation and programming problems, not about the statistical or theoretical properties of the technique. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics and machine learning.

Read more:

9532 questions
48
votes
5 answers

Python natural smoothing splines

I am trying to find a python package that would give an option to fit natural smoothing splines with user selectable smoothing factor. Is there an implementation for that? If not, how would you use what is available to implement it yourself? By…
Niko Föhr
  • 28,336
  • 10
  • 93
  • 96
47
votes
3 answers

How to calculate the regularization parameter in linear regression

When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update…
London guy
  • 27,522
  • 44
  • 121
  • 179
45
votes
3 answers

Linear Regression with a known fixed intercept in R

I want to calculate a linear regression using the lm() function in R. Additionally I want to get the slope of a regression, where I explicitly give the intercept to lm(). I found an example on the internet and I tried to read the R-help "?lm"…
R_User
  • 10,682
  • 25
  • 79
  • 120
43
votes
2 answers

What does the capital letter "I" in R linear regression formula mean?

I haven't been able to find an answer to this question, largely because googling anything with a standalone letter (like "I") causes issues. What does the "I" do in a model like this? data(rock) lm(area~I(peri - mean(peri)), data =…
Nancy
  • 3,989
  • 5
  • 31
  • 49
42
votes
5 answers

setting values for ntree and mtry for random forest regression model

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201. I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an…
DOSMarter
  • 1,485
  • 5
  • 21
  • 29
40
votes
4 answers

how to use the Box-Cox power transformation in R

I need to transform some data into a 'normal shape' and I read that Box-Cox can identify the exponent to use to transform the data. For what I understood car::boxCoxVariable(y) is used for response variables in linear models,…
dede
  • 1,129
  • 5
  • 15
  • 35
40
votes
4 answers

What is the difference between Multiple R-squared and Adjusted R-squared in a single-variate least squares regression?

Could someone explain to the statistically naive what the difference between Multiple R-squared and Adjusted R-squared is? I am doing a single-variate regression analysis as follows: v.lm <- lm(epm ~ n_days, data=v) …
fmark
  • 57,259
  • 27
  • 100
  • 107
39
votes
7 answers

predict.lm() with an unknown factor level in test data

I am fitting a model to factor data and predicting. If the newdata in predict.lm() contains a single factor level that is unknown to the model, all of predict.lm() fails and returns an error. Is there a good way to have predict.lm() return a…
Stephan Kolassa
  • 7,953
  • 2
  • 28
  • 48
37
votes
3 answers

GridSearchCV - XGBoost - Early Stopping

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task.…
ayyayyekokojambo
  • 1,165
  • 3
  • 13
  • 33
36
votes
3 answers

Difference between cross_val_score and cross_val_predict

I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and cross_val_predict I should use. One option would be : cvs = DecisionTreeRegressor(max_depth =…
35
votes
4 answers

What is the difference between xgb.train and xgb.XGBRegressor (or xgb.XGBClassifier)?

I already know "xgboost.XGBRegressor is a Scikit-Learn Wrapper interface for XGBoost." But do they have any other difference?
Statham
  • 4,000
  • 2
  • 32
  • 45
35
votes
4 answers

Show confidence limits and prediction limits in scatter plot

I have two arrays of data for height and weight: import numpy as np, matplotlib.pyplot as plt heights = np.array([50,52,53,54,58,60,62,64,66,67,68,70,72,74,76,55,50,45,65]) weights =…
Eric Bal
  • 1,115
  • 3
  • 12
  • 16
35
votes
3 answers

Scikit-learn cross validation scoring for regression

How can one use cross_val_score for regression? The default scoring seems to be accuracy, which is not very meaningful for regression. Supposedly I would like to use mean squared error, is it possible to specify that in cross_val_score? Tried the…
clwen
  • 20,004
  • 31
  • 77
  • 94
33
votes
12 answers

ValueError: feature_names mismatch: in xgboost in the predict() function

I have trained an XGBoostRegressor model. When I have to use this trained model for predicting for a new input, the predict() function throws a feature_names mismatch error, although the input feature vector has the same structure as the training…
Sujay S Kumar
  • 621
  • 1
  • 5
  • 10
33
votes
7 answers

sklearn LogisticRegression and changing the default threshold for classification

I am using LogisticRegression from the sklearn package, and have a quick question about classification. I built a ROC curve for my classifier, and it turns out that the optimal threshold for my training data is around 0.25. I'm assuming that the…