7

In classical statistics, people usually state what assumptions are assumed (i.e. normality and linearity of data, independence of data). But when I am reading machine learning textbooks and tutorials, the underlying assumptions are not always explicitly or completely stated. What are the major assumptions of the following ML classifiers for binary classification, and which ones are not so important to uphold and which one must be uphold strictly?

  • Logistic regression
  • Support vector machine (linear and non-linear kernel)
  • Decision trees
KubiK888
  • 4,377
  • 14
  • 61
  • 115

2 Answers2

2

IID is the fundamental assumption of almost all statistical learning methods.

Logistic Regression is a special case of GLM(generalized linear model). So despite some technique requirements, the most strict restriction lies in the specific distribution of data distribution. Data MUST has a distribution in exponential family. You can dig deeper in https://en.wikipedia.org/wiki/Generalized_linear_model, and Stanford CS229 lecture note1 also has a excellent coverage of this topic.

SVM is quite tolerant of input data, especially the soft-margin version. I can not remember any specific assumption of data is taken(please correct).

Decision tree tells the same story as SVM.

RogerTR
  • 49
  • 2
2

Great question.

Logistic Regression also assumes the following:

  1. That there isn't (or there is little) multicollinearity (high correlation) among the independent variables.

  2. Even though LR doesn't require the dependent and independent variables to be linearly related, it does however require that the independent variables to be linearly related to the log odds. The log odds function is simply log(p/1-p).

msarafzadeh
  • 395
  • 1
  • 4
  • 15