2

First of all thanks in advance, I don't really know if I should open an issue so I wanted to check if someone had faced this before.

So I'm having the following problem when using a CalibratedClassifierCV for text classification. I have an estimator which is a pipeline created this way (simple example):

# Import libraries first
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import LogisticRegression

# Now create the estimators: pipeline -> calibratedclassifier(pipeline)
pipeline = make_pipeline( TfidfVectorizer(), LogisticRegression() )
calibrated_pipeline = CalibratedClassifierCV( pipeline, cv=2 )

Now we can create a simple train set to check if the classifier works:

# Create text and labels arrays
text_array = np.array(['Why', 'is', 'this', 'happening'])
outputs = np.array([0,1,0,1])

When I try to fit the calibrated_pipeline object, I get this error:

ValueError: Found input variables with inconsistent numbers of samples: [1, 4]

If you want I can copy the whole exception trace, but this should be easily reproducible. Thanks a lot in advance!

EDIT: I made a mistake when creating the arrays. Fixed now (Thanks @ogrisel !) Also, calling:

pipeline.fit(text_array, outputs)

works properly, but doing so with the calibrated classifier fails!

  • You should always report the full traceback when reporting an error. It's very often the case that the answer to your question is there. – ogrisel Feb 02 '17 at 14:51

1 Answers1

0

np.array(['Why', 'is', 'this', 'happening']).reshape(-1,1) is a 2D array of strings while the docstring of the fit_transform method of the TfidfVectorizer class states that it expects:

    Parameters
    ----------
    raw_documents : iterable
        an iterable which yields either str, unicode or file objects

If you iterate over your 2D numpy array you get a sequence of 1D arrays of strings instead of strings directly:

>>> list(text_array)
[array(['Why'], 
      dtype='<U9'), array(['is'], 
      dtype='<U9'), array(['this'], 
      dtype='<U9'), array(['happening'], 
      dtype='<U9')]

So the fix is easy, just pass:

text_documents = ['Why', 'is', 'this', 'happening']

as the raw input to the vectorizer.

Edit: remark: LogisticRegression is almost always a well calibrated classifier by default. It will likely be the case that CalibratedClassifierCV won't bring anything in this case.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thanks a lot @ogrisel! It's true that a logistic regression is generally well calibrated, but this was just an example, in my real application I need to use other clasiffiers and more pre-processing steps inside the pipeline (including custom functions). Now ignoring this, you are right, I was mistaken when reshaping the vector. However, running this: `# Create text and labels arrays` `text_array = np.array(['Why', 'is', 'this', 'happening'])` `outputs = np.array([0,1,0,1])` And calling `fit` on the `pipeline` only, the thing works, but doing so in the calibrated pipeline fails. – Iñigo Cortajarena Sauca Feb 02 '17 at 17:09
  • Also @ogrisel calling fit with lists instead of arrays give me an error too, and doing so still works with the `pipeline` but fails with the `calibrated_pipeline`. The error says: `ValueError: Found input variables with inconsistent numbers of samples: [1, 4]`. May this be something regarding the shape of the input expected by the estimator inside tha calibrated object clashing with the iterable expected by the TF-IDF? Thanks for the effort! Inigo. – Iñigo Cortajarena Sauca Feb 02 '17 at 17:23
  • Hum, I think this can be considered a bug of CallibratedClassifierCV: it should be less strict in its input validation (basically do no check by itself to delegate input checks it to the underlying estimator). Feel free to open an issue on github and issue a pull request. – ogrisel Feb 03 '17 at 10:18
  • Thanks a lot man! I will open an issue and a p.r. and try to solve it myself. Best regards, Inigo – Iñigo Cortajarena Sauca Feb 03 '17 at 11:48