2

I'm experimenting with several sklearn classifiers in a Voting Classifier for ensembling.

To test, I have a dataframe with set of columns that represent tool skills (a numerical value from 0 to 10 representing how much the person knows about the skill) and a "Fit to Job" column that is the class variable. Example:

import pandas as pd
df = pd.DataFrame(columns=["Python", "Scikit-learn", "Pandas", "Fit to Job"])
total_mock_samples= 100
for i in range(total_mock_samples):
    df=df.append(mockResults(df.columns, 'Fit to Job', good_values=i > total_mock_samples/2), ignore_index=True)

#Fills dataframe with mock data
#Output like:
print(np.array(df))
#[[1. 3. 6. 1.]
# [3. 2. 3. 0.]
# [1. 4. 0. 0.]
# ...
# [7. 8. 8. 1.]
# [8. 7. 9. 1.]]

Then I mount my ensemble classifiers:

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

X = np.array(df[df.columns[:-1]])
y = np.array(df[df.columns[-1]])
rfc = RandomForestClassifier(n_estimators=10)
svc = SVC(kernel='linear')
knn = KNeighborsClassifier(n_neighbors=5)
nb = GaussianNB()
lr = LinearRegression()

ensemble = VotingClassifier(estimators=[("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc), ("Linear Reg.",lr)])

Finally, I try to evaluate it with Cross validation, like so:

cval_score = cross_val_score(ensemble, X, y, cv=10)

But I'm getting the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-f7c01fa872d2> in <module>
    182 ensemble = VotingClassifier(estimators=[("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc), ("Linear Reg.",lr)])
    183 
--> 184 cval_score = cross_val_score(ensemble, X, y, cv=10)
[...]

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

I've checked other answers, but they all refer to numpy data conversions. The error is happening inside the cross validation phase. I tried to apply their solutions with no luck.

I've also attempted to change data type prior to calculating the score with no success.

Maybe someone have a more keen eye to see where's the problem.

EDIT 01: Mock results generator function

def mockResults(columns, result_column_name='Fit', min_value = 0, max_value=10, good_values=False):
    mock_res = {}
    for column in columns:
        mock_res[column] = 0
        if column == result_column_name:
            if good_values == True:
                mock_res[column] = float(1)
            else:
                mock_res[column] = float(0)
        elif good_values == True:
            mock_res[column] = float(random.randrange(int(max_value*0.7), max_value))
        else:
            mock_res[column] = float(random.randrange(min_value, int(max_value*0.5)))
    return mock_res
Itamar Mushkin
  • 2,803
  • 2
  • 16
  • 32
Tiago Duque
  • 1,956
  • 1
  • 12
  • 31
  • does your data has any `nan` values? If not, you can first try `df = df.astype(int)`? – Quang Hoang Sep 19 '19 at 13:49
  • @QuangHoang I tried that on example data without `nan` values, and it didn't work, even when casting X and y to type int with `.astype` – Itamar Mushkin Sep 19 '19 at 13:52
  • Your code works fine with `df = pd.DataFrame(np.random.randint(0,10,(100,10)))`. – Quang Hoang Sep 19 '19 at 13:53
  • This `randint` data does not produce a label (0/1) in the last column... – Itamar Mushkin Sep 19 '19 at 13:58
  • I'm building my example as a Dataframe. I printed it as a NP array just for the sake of simplicity. I have no NAN values, they are generated by a mock function that makes an equal 50% distribution to classes by giving good values to fit and bad values to not fit. The example is not a real classification attempt, but rather a test. – Tiago Duque Sep 19 '19 at 14:00
  • Please share you example fully and explicitly (or accept my edit, if it is an acceptable example). – Itamar Mushkin Sep 19 '19 at 14:02
  • I've just added the Mock results generator. Maybe you can understand it better. Also, why did you suggest a cv of 5? – Tiago Duque Sep 19 '19 at 14:04
  • Because I wanted a minimal reproducible example (https://stackoverflow.com/help/minimal-reproducible-example), so I made one manually - and I had to reduce cv because my data was smaller. – Itamar Mushkin Sep 19 '19 at 14:05
  • 2
    In the spirit of a minimal reproducible example - which is also part of the debugging process - know that the error reproduces on a simple `.predict`, not just on `cross_val_score` - so that's not the problem – Itamar Mushkin Sep 19 '19 at 14:08
  • I've changed the title to fit accordingly. I've also checked that. Cross_val_score does .predict on the background. – Tiago Duque Sep 19 '19 at 14:09
  • Yes, what I'm saying is that when you try to isolate the problem, you should narrow down the search, and edit the question accordingly. If the problem is with the more basic, underlying `.predict`, then show that in the question, not the `cross_val_score`. – Itamar Mushkin Sep 19 '19 at 14:13
  • 1
    @TiagoDuque if I remove LinearRegression from VotingClassifier() or I change it to LogisticRegression(), it works fine. – vb_rises Sep 19 '19 at 14:23
  • Also, when you see the error with LinearRegression, and see the stack trace, it points to nb.binCount() function. maybe [this](https://stackoverflow.com/questions/49761807/an-typeerror-with-votingclassifier) solution could help. – vb_rises Sep 19 '19 at 14:30

1 Answers1

1
df = pd.DataFrame(columns=["Python", "Scikit-learn", "Pandas", "Fit to Job"], data=np.random.randint(1, 10,size=(400,4)))    

class LinearRegressionInt(LinearRegression):
    def predict(self,X):
        predictions = self._decision_function(X)

        return np.asarray(predictions, dtype=np.int64).ravel()
... 
lr = LinearRegressionInt()
...

ensemble = VotingClassifier(estimators=[("lr",lr),("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc)] )

cval_score = cross_val_score(ensemble, X, y, cv=10)
cval_score

array([ 0.09090909,  0.11904762,  0.17073171,  0.14634146,  0.17073171,
    0.15384615,  0.07692308,  0.15384615,  0.10810811,  0.08108108])

Reference: An Typeerror with VotingClassifier

vb_rises
  • 1,847
  • 1
  • 9
  • 14