2

I'm trying to understand the relationship between sklearn's .fit() method and the .predict() method; mainly, how exactly is data (typically) being passed from one to the other. I haven't found another question on SO that's addressed this, but have danced around it (i.e. here)

I've written a custom estimator, using the BaseEstimator and RegressorMixin classes, but have run into a 'NotFittedError' a handful of times as I've begun running my data through it. Could someone walk me through a simple linear regression and how the data is passed through the fit and predict methods? No need to get into the math - I understand how regressions work and what the pieces of the puzzle do. Maybe I'm overlooking the obvious and making it more complicated than it shoudld be? But the estimator methods are feeling like a bit of a black box.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
alofgran
  • 427
  • 7
  • 18
  • you have object which keeps all data inside. – furas Oct 25 '19 at 05:21
  • Right, but if I’m writing a custom estimator, how am I transferring the information gathered from the fit method to the predict method? The error I’m getting is telling me that the model has not yet been fitted, so there’s a disconnect present between these two methods in my custom class. – alofgran Oct 25 '19 at 05:23
  • if you have error then show it in question with code. – furas Oct 25 '19 at 05:27
  • You have object - instance of some class. And class has variables which are avaliable in all its methods. You use `self.` for this. – furas Oct 25 '19 at 05:27
  • The error is pretty narrow, but it’s only symptomatic to my question. I only mentioned the error to support the question itself. I’m more concerned with understanding how the information is transferred (a broader solution to the underlying problem). – alofgran Oct 25 '19 at 05:29
  • 2
    do you know OOP (Object Oriented Programming) and how classes work ? There is no transfering - both methods have access to the same variables. And it has nothing to do with math or machine learning. – furas Oct 25 '19 at 05:34
  • Have you checked and experimented with, say, the linear regression example in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)? If yes, what particular issues/questions do you have? (If no, please do) – desertnaut Oct 25 '19 at 06:57
  • Yes, @desertnaut. It did not provide a sufficient answer. – alofgran Oct 25 '19 at 14:22
  • @furas, I understand the basics of OOP, but I can't say I'm 100% comfortable with classes (as you can probably tell from my question). I realize that this question has nothing to do with math, and that the explanation I'm seeking is a transferable concept outside of machine learning (though I utilized the 'machine-learning' tag because this example is specifically about machine-learning methods). – alofgran Oct 25 '19 at 14:25
  • So, what about the 2 answers below? – desertnaut Oct 25 '19 at 14:26
  • @desertnaut - checking them now – alofgran Oct 25 '19 at 14:26
  • with functions you would have to do `model = fit(train_data)` and later `predict(model, test_data)` to manually transfer trained `model` from one function to another. OR you can use (not preferred) `global model` in both functions and run `fit(train_data)` and `predict(test_data)` and you don't have to transfer `model` manually because they use the same global variable for model. In class you have "class variables" which behave like global variables. They keep values outside methods so they can be used to transfer data from one method to another and you don't have to do it manually. – furas Oct 25 '19 at 15:21

2 Answers2

2

NotFittedError happens when you try to use the .predict() method of your classifier before you have trained or used the .fit() method.

Lets take for example the LinearRegression from scikit learn.

>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_ 
3.0000...
>>> reg.predict(np.array([[3, 5]]))
array([16.])

so with the line reg = LinearRegression().fit(X, y) you are instantiating the class LinearRegression and then fit it to your data X and y where X the independent variables and y your dependent. Once the model is trained inside that class the beta coefficients for the linear regression is saved in the class attribute coef_ and you could access it using reg.coef_. That's how the class knows to predict when you use the .predict() class method. The class accesses those coefficients and then its just simple algebra to produce a prediction.

So back to your error. If you aren't fitting the model to your training data then the class doesn't have the necessary attributes needed to make the predictions. Hopefully that clears up some confusion on whats going on inside the class at least with regards to how the fit() and predict() methods interact.

Ultimately like commented above this goes back to the fundamentals of Object Oriented Programming so if you wanted to learn further I would read about how Python handles Classes as scikit learn models follow the same behavior

Matthew Barlowe
  • 2,229
  • 1
  • 14
  • 24
  • Maybe a demonstration of a `NotFittedError` would be useful? – desertnaut Oct 25 '19 at 06:53
  • Ok. So, the connection between the two methods is through the coef_ attribute of the fit method. This is what is called by the .predict() method. Therefore, if I'm writing a custom class, inside my .predict() method, do I need to refer to the "reg.coef_" of method I fitted? I didn't think that simply calling reg.predict() was sufficient... – alofgran Oct 25 '19 at 14:29
  • 1
    @alofgran the connection between the two methods is that they belong to the LinearRegression class. The coef_ is an attribute of the class LinearRegression not the method fit(). The class itself is what ties all these things together. Because of this this allows the fit method to alter the values in coef_ when it is run and for predict to access those values when it is run – Matthew Barlowe Oct 25 '19 at 14:49
  • @MatthewBarlowe, that’s making sense to me. I understand that difference now. Thanks, but the coef_ attribute is what must be referenced in the predict method still to utilize the fitted coefficients stored in the instance of the LinearRegression class, correct? – alofgran Oct 25 '19 at 14:56
  • @alofgran I would assume so although I don’t have the source code for it right in front of me – Matthew Barlowe Oct 25 '19 at 15:07
  • Got it. I referenced the source code for the RegressorMixin in my question above because I figured that’s what it might come down to. I haven’t been able to figure out how .predict() pulls .coef_ from though. Thanks for your explanation though - I think I’m 90-95% of the way there. – alofgran Oct 25 '19 at 15:10
  • @alofgran every class method has access to the class variables or attributes of the class in this case it’s coef_ but it could be anything – Matthew Barlowe Oct 25 '19 at 15:13
  • Coupled with @furas note on `self.W` below, this correctly answers my question. It validates that the information gathered by the `.fit()` method is stored in the `LinearRegression` class instance under `self.coef_`, which can then be utilized in the `.predict()` method, by calling upon `self.coef_`. – alofgran Oct 25 '19 at 19:14
2

Lets look at a toy Estimator doing the LinearRegression

from sklearn.base import TransformerMixin, BaseEstimator
import numpy as np

class ToyEstimator(BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y):
        X = np.hstack((X,np.ones((len(X),1))))
        self.W = np.random.randn(X.shape[1])

        self.W = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)), X.T), y)
        self.coef_ = self.W[:-1]
        self.intercept_ = self.W[-1]
        return self


    def transform(self, X):
        X = np.hstack((X,np.ones((len(X),1))))
        return np.dot(X,self.W)

X = np.random.randn(10,3)
y = X[:,0]*1.11+X[:,1]*2.22+X[:,2]*3.33+4.44

reg = ToyEstimator()
reg.fit(X,y)
y_ = reg.transform(X)
print (reg.coef_, reg.intercept_)

Output:

[1.11 2.22 3.33] 4.4399999999999995

So what did the above code do ?

  1. In fit we fit\train the weights using the training data. These weights are member variables of the class [this is something which you learn in OOPs]
  2. The transform method makes a prediction on the data using the trained weighs which are stored as member variables.

So before calling transform you need to call fit because transform uses the weights which are calculated during fit.

In sklearn modules if you call a transform before fit you get a NotFittedError exception.

mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • Where are the fitted variables referenced in the .transform() method? – alofgran Oct 25 '19 at 14:41
  • 1
    @alofgran in `transform()` should be `self.W` instead of `W` and this is variable used by `fit()` to keed data. So both methods use the same varaible `self.W` to transfer data from one method to another. – furas Oct 25 '19 at 15:13