SKlearn Random Forest error on input

Question

I am trying to run the fit for my random forest, but I am getting the following error:

forest.fit(train[features], y)

returns

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-603415b5d9e6> in <module>()
----> 1 forest.fit(train[rubio_top_corr], y)

/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
    210         """
    211         # Validate or convert input data
--> 212         X = check_array(X, dtype=DTYPE, accept_sparse="csc")
    213         if issparse(X):
    214             # Pre-sort indices to avoid that each individual tree of the

/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    396                              % (array.ndim, estimator_name))
    397         if force_all_finite:
--> 398             _assert_all_finite(array)
    399 
    400     shape_repr = _shape_repr(array.shape)

/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
     52             and not np.isfinite(X).all()):
     53         raise ValueError("Input contains NaN, infinity"
---> 54                          " or a value too large for %r." % X.dtype)
     55 
     56 

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I have coerced my dataframe from float64 to float32 for my features and made sure that there's no nulls so not sure what's throwing this error. Let me know if it would be helpful to put in more of my code.

UPDATE

It's originally a pandas dataframe, which I dropped all NaNs. The original dataframe is survey results with respondent information, where I dropped all questions except for my dv. I double checked this by running rforest_df.isnull().sum() which returned 0. This is the full code I used for the modeling.

rforest_df = qfav3_only
rforest_df[features] = rforest_df[features].astype(np.float32)
rforest_df['is_train'] = np.random.uniform(0, 1, len(rforest_df)) <= .75
train, test = rforest_df[rforest_df['is_train']==True], rforest_df[rforest_df['is_train']==False]
forest = RFC(n_jobs=2,n_estimators=50)
y, _ = pd.factorize(train['K6_QFAV3'])
forest.fit(train[features], y)

Update

This is what y data looks like

array([ 0,  1,  2,  3,  4,  3,  3,  5,  6,  7,  8,  7,  9,  6, 10,  6, 11,
        7, 11,  3,  7,  9,  6,  5,  9, 11, 12, 13,  6, 11,  3,  3,  6, 14,
       15,  0,  9,  9,  2,  0, 11,  3,  9,  4,  9,  7,  3,  4,  9, 12,  9,
        7,  6, 13,  6,  0,  0, 16,  6, 11,  4, 10, 11, 11, 17,  3,  6, 16,
        3,  4, 18, 19,  7, 11,  5, 11,  5,  4,  0,  6, 17,  7,  2,  3,  5,
       11,  8,  9, 18,  6,  9,  8,  5, 16, 20,  0,  4,  8, 13, 16,  3, 20,
        0,  5,  4,  2, 11,  0,  3,  0,  6,  6,  6,  9,  4,  6,  5, 11,  0,
       13,  6,  2, 11,  7,  5,  6, 18, 12, 21, 17,  3,  6,  0, 13, 21,  7,
        3,  2, 18, 22,  7,  3,  2,  6,  7,  8,  4,  0,  7, 12,  3,  7,  3,
        2, 11, 19, 11,  6,  2,  9,  3,  7,  9,  9,  5,  6,  8,  0, 18, 11,
        3, 12,  2,  6,  4,  7,  7, 11,  3,  6,  6,  0,  6, 12, 15,  3,  9,
        3,  3,  0,  5,  9,  7,  9, 11,  7,  3, 20,  0,  7,  6,  6, 23, 15,
       19,  0,  3,  6, 16, 13,  5,  6,  6,  3,  6, 11,  9,  0,  6, 23, 16,
        4,  0,  6, 17, 11, 17, 11,  4,  3, 13,  3, 17, 16, 11,  7,  4, 24,
        5,  2,  7,  7,  8,  3,  3, 11,  8,  7, 23,  7,  7, 11,  7, 11,  6,
       15,  3, 25,  7,  4,  5,  3, 17, 20,  3, 26,  7,  9,  6,  6, 17, 20,
        1,  0, 11,  9, 16, 20,  7,  7, 26,  3,  6, 20,  7,  2, 11,  7, 27,
        9,  4, 26, 28,  8,  6,  9, 19,  7, 29,  3,  2, 26, 30,  6, 31,  6,
       18,  3,  0, 18,  4,  7, 32,  0,  2,  8,  0,  5,  9,  4, 16,  6, 23,
        0,  7,  0,  7,  9,  6,  8,  3,  7,  9,  3,  3, 12, 11,  8, 19, 20,
        7,  3,  5, 11,  3, 11,  8,  4,  4,  6,  9,  4,  1,  3,  0,  9,  9,
        6,  7,  8, 33,  8,  7,  9, 34, 11, 11,  6,  9,  9, 17,  8, 19,  0,
        7,  4, 17,  6,  7,  0,  4, 12,  7,  6,  4, 16, 12,  9,  6,  6,  6,
        6, 26, 13,  9,  7,  2,  7,  3, 11,  3,  6,  7, 19,  4,  8,  9, 13,
       11, 15, 11,  4, 18,  7,  7,  7,  0,  5,  4,  6,  0,  3,  7,  4, 25,
       18,  6, 19,  7,  9,  4, 20,  6,  3,  7,  4, 35, 15, 11,  2, 12,  0,
        7, 32,  6, 18,  9,  9,  6,  2,  3, 19, 36, 32,  0,  7,  0,  9, 37,
        3,  5,  6,  5, 34,  2,  6,  0,  7,  0,  7,  3,  7,  4, 18, 18,  7,
        3,  7, 16,  9, 19, 13,  4, 16, 19,  3, 19, 38,  9,  4,  9,  8,  0,
       17,  0,  2,  3,  5,  6,  5, 11, 11,  2,  9,  5, 33,  9,  5,  6, 20,
       13,  3, 39, 13,  7,  0,  9,  0,  4,  6,  7, 16,  7,  0, 21,  5,  3,
       18,  5, 20,  2,  2, 14,  6, 17, 11, 11, 16, 16,  9,  8, 11,  3, 23,
        0, 11,  0,  6,  0,  0,  3, 16,  6,  7,  5,  9,  7, 13,  0, 20,  0,
       25,  6, 16,  8,  4,  4,  2,  8,  7,  5, 40,  3,  8,  5, 12,  8,  9,
        6,  6,  6,  6,  3,  7, 26,  4,  0, 13,  4,  3, 13, 12,  7,  7,  6,
        7, 19, 15,  0, 33,  4,  5,  5, 20,  3, 11,  5,  4,  7,  9,  7, 11,
       36,  9,  0,  6,  6, 11,  6,  4,  2,  5, 18,  8,  5,  5,  2, 25,  4,
       41,  7,  7,  5,  7,  3, 36, 11,  6,  9,  0,  9,  0, 16, 42, 11, 11,
       18,  9,  5, 36,  2,  9,  6,  3, 43,  9, 17, 13,  5,  9,  3,  4,  6,
       44, 37,  0, 45,  2, 18,  8, 46,  2, 12,  9,  9,  3, 16,  6, 12,  9,
        0, 11, 11,  0, 25,  8, 17,  4,  4,  3, 11,  3, 11,  6,  6,  9,  7,
       23,  0,  2,  0,  3,  3,  4,  4,  9,  5, 11, 16,  7,  3, 18, 11,  7,
        6,  6,  6,  5,  9,  6,  3,  9,  7, 17, 11,  4,  9,  2,  3,  0, 26,
        9,  0, 20,  8,  9,  6, 11,  6,  6,  7, 26,  6,  6,  4, 19,  5, 41,
       19, 18, 29,  6,  5, 13,  6, 11,  7,  7,  6,  8,  5,  0,  3, 13, 17,
        6, 20, 11,  6,  9,  6,  2,  7, 11,  9, 20, 12,  7,  6,  8,  7,  4,
        6,  2,  0,  7,  9, 26,  9, 16,  7,  4, 45,  7,  0, 23,  8,  4, 19,
        4, 26, 11,  4,  4,  5,  7,  3,  0, 29, 12,  3,  4, 11,  4, 12,  8,
        7,  5,  0, 47, 12,  0, 25,  6, 16, 20,  5,  8,  4,  4, 11, 12,  0,
        6,  3, 11,  4,  3, 48,  3,  6,  7,  4,  7,  0,  3,  7,  3, 18,  6,
        2,  9,  9, 11,  3,  9,  6, 18, 16,  6, 34,  2,  7,  4,  3, 45,  5,
        0,  7,  2, 17, 17,  9, 18,  5,  6,  5, 15,  5,  7,  6,  9,  0,  7,
       12, 17])

Can you add information about the structure of train[features]. I'm assuming it's an n_samples by n_features 2D numpy array which is required by RF. — benj, Jan 13 '16 at 16:55
Also the last line of the error gives you a clue to the problem. Does your input feature vector contain NaNs? — benj, Jan 13 '16 at 16:56
It doesn't, I double checked as well using `isnull().sum()` and I coerced the dataframe into float32 from float64 due to the error raised. I know it doesn't contain infinity as well. Unsure why this is the error being thrown. — rontho1992, Jan 13 '16 at 19:55
Have you tried using [DataFrame.fillna(0)](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.fillna.html) to test if that fixes the error? — Charlie Haley, Jan 13 '16 at 21:39
A few complete shots in the dark for troubleshooting purposes: (1) check your column dtypes, (2) check `.isnull()` and `np.isinf` _after_ you force type conversion (you may already be doing this), (3) try passing your features as a numpy array instead of a DataFrame. — Patrick O'Connor, Jan 13 '16 at 23:54
Can you show us what the X looks like. i am also confused by your notation. on the top of the page you have train[features] and then latter on you have train[rubio_top_corr] — Hemanth Kondapalli, Feb 18 '16 at 02:42

score 1 · Answer 1 · answered Oct 04 '16 at 16:22

I would first recommend that you check the datatype for each column within your train[features] df in the following way:

print train[features].dtypes

If you see there are columns that are non-numeric, you can inspect these columns to ensure there are not any unexpected values (e.g. strings, NaNs, etc.) that would cause problems. If you do not mind dropping out non-numeric columns, you can simply select all the numeric columns with the following:

numeric_cols = X.select_dtypes(include=['float64','float32']).columns

You can also add in columns with int dtypes if you would like.

If you are encountering values that are too large or too small for the model to handle, it is a sign that scaling the data is a good idea. In sklearn, this can be accomplished as follows:

scaler = MinMaxScaler(feature_range=(0,1),copy=True).fit(train[features])
train[features] = scaler.transform(train[features])

Finally you should consider imputing missing values with sklearn's Imputer as well as filling NaNs with something like the following:

train[features].fillna(0, inplace=True)

score 0 · Answer 2 · answered Nov 23 '17 at 00:59

It happens when you have null-strings (like '') in the dataset. Also try to print something like

pd.value_counts()

or even

sorted(list(set(...)))

or get min or max values for each column in the dataset in the loop.

Above example with MinMaxScaler may work, but scaled features do not work well in the RF.

SKlearn Random Forest error on input

2 Answers2