4

I am trying to train my binary classifier over a huge data. Previously, I could accomplish training via using fit method of sklearn. But now, I have more data and I cannot cope with them. I am trying to fitting them partially but couldn't get rid of errors. How can I train my huge data incrementally? With applying my previous approach, I get an error about pipeline object. I have gone through the examples from Incremental Learning but still running these code samples gives error. I will appreciate any help.

X,y = transform_to_dataset(training_data)

clf = Pipeline([
    ('vectorizer', DictVectorizer()),
    ('classifier', LogisticRegression())])

length=len(X)/2

clf.partial_fit(X[:length],y[:length],classes=np.array([0,1]))

clf.partial_fit(X[length:],y[length:],classes=np.array([0,1]))

ERROR

AttributeError: 'Pipeline' object has no attribute 'partial_fit'

TRYING GIVEN CODE SAMPLES:

clf=SGDClassifier(alpha=.0001, loss='log', penalty='l2', n_jobs=-1,
                      #shuffle=True, n_iter=10, 
                      verbose=1)
length=len(X)/2

clf.partial_fit(X[:length],y[:length],classes=np.array([0,1]))

clf.partial_fit(X[length:],y[length:],classes=np.array([0,1]))

ERROR

File "/home/kntgu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 573, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/kntgu/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number

My dataset consists of some sentences with their part of speech tags and dependency relations.

Thanks  NN  0   root
to  IN  3   case
all DT  1   nmod
who WP  5   nsubj
volunteered VBD 3   acl:relcl
.   .   1   punct

You PRP 3   nsubj
will    MD  3   aux
remain  VB  0   root
as  IN  5   case
alternates  NNS 3   obl
.   .   3   punct
kntgu
  • 184
  • 1
  • 2
  • 14

3 Answers3

5

A Pipeline object from scikit-learn does not have the partial_fit, as seen in the docs.

The reason for this is that you can add any estimator you want to that Pipeline object, and not all of them implement the partial_fit. Here is a list of the supported estimators.

As you see, using SGDClassifier (without Pipeline), you don't get this "no attribute" error, because this specific estimator is supported. The error message you get for this one is probably due to text data. You can use the LabelEncoder to process the non-numeric columns.

BenjaVR
  • 530
  • 5
  • 15
  • 1
    No, even if `partial_fit()` is supported by the last estimator, it may not be supported by intermediate steps and then an error will be thrown anyways on the new data. Pipeline will not handle `partial_fit()` even if SGDClassifier is present in it. – Vivek Kumar May 10 '18 at 09:02
  • @VivekKumar I am not saying "if SGDClassifier is in the Pipeline it will work", I'm saying using the "SGDClassifier ...", as you can see in his second code snippet (no `Pipeline` there). But because it was not clear, I've added "without Pipeline" to my answer. – BenjaVR May 10 '18 at 09:03
  • I just can't understand why it gives that error although it runs totally fine when I use fit method. – kntgu May 10 '18 at 09:07
  • 2
    @kntgu You can put objects (steps) in a `Pipeline` that do have the `partial_fit()` method, but also that do **not** have this method. When you call `fit()` on a `Pipeline` object, it will call this `fit()` on every step you've added. And you can only add steps to a `Pipeline` if they have this `fit()` method. However, imagine the `Pipeline` does support `partial_fit()`, only a few steps have the `partial_fit()`, so the `Pipeline` wants to call this method for each step, but the steps simply do not support it. That's why the `Pipeline` itself does not support this method, hope it's clear now. – BenjaVR May 10 '18 at 09:12
1

I was going through the same problem as SGDClassifier inside pipeline doesn't support the incremental learning (i.e. partial_fit param). There is a way we could do incremental learning using sklearn but it is not with partial_fit, it is with warm_start. There are two algorithms in sklearn LogisticRegression and RandomForest that support warm_start.

warm start is another way of doing incremental_learning. read here

manish Prasad
  • 636
  • 6
  • 16
0

pipeline has no attribute partial_fit as there are many models with no partial_fit which can be assigned to the pipeline. My solution for this is to make a dictionary rather than pipeline and save it as joblib.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

from sklearn.linear_model import SGDClassifier
model=SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)

tosave={
    "model":model,
    "count":count_vect,
    "tfid":tfidf_transformer,
}

import joblib
filename = 'package.sav'
joblib.dump(tosave, filename)

Then use

import joblib
filename = 'package.sav'
pack=joblib.load(filename)

pack['model'].partial_fit(X,Y)

Imran Khan
  • 61
  • 3