I am trying to run the fit for my random forest, but I am getting the following error:
forest.fit(train[features], y)
returns
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-603415b5d9e6> in <module>()
----> 1 forest.fit(train[rubio_top_corr], y)
/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.pyc in fit(self, X, y, sample_weight)
210 """
211 # Validate or convert input data
--> 212 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
213 if issparse(X):
214 # Pre-sort indices to avoid that each individual tree of the
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
396 % (array.ndim, estimator_name))
397 if force_all_finite:
--> 398 _assert_all_finite(array)
399
400 shape_repr = _shape_repr(array.shape)
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
52 and not np.isfinite(X).all()):
53 raise ValueError("Input contains NaN, infinity"
---> 54 " or a value too large for %r." % X.dtype)
55
56
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I have coerced my dataframe from float64 to float32 for my features and made sure that there's no nulls so not sure what's throwing this error. Let me know if it would be helpful to put in more of my code.
UPDATE
It's originally a pandas dataframe, which I dropped all NaNs. The original dataframe is survey results with respondent information, where I dropped all questions except for my dv. I double checked this by running rforest_df.isnull().sum()
which returned 0. This is the full code I used for the modeling.
rforest_df = qfav3_only
rforest_df[features] = rforest_df[features].astype(np.float32)
rforest_df['is_train'] = np.random.uniform(0, 1, len(rforest_df)) <= .75
train, test = rforest_df[rforest_df['is_train']==True], rforest_df[rforest_df['is_train']==False]
forest = RFC(n_jobs=2,n_estimators=50)
y, _ = pd.factorize(train['K6_QFAV3'])
forest.fit(train[features], y)
Update
This is what y data looks like
array([ 0, 1, 2, 3, 4, 3, 3, 5, 6, 7, 8, 7, 9, 6, 10, 6, 11,
7, 11, 3, 7, 9, 6, 5, 9, 11, 12, 13, 6, 11, 3, 3, 6, 14,
15, 0, 9, 9, 2, 0, 11, 3, 9, 4, 9, 7, 3, 4, 9, 12, 9,
7, 6, 13, 6, 0, 0, 16, 6, 11, 4, 10, 11, 11, 17, 3, 6, 16,
3, 4, 18, 19, 7, 11, 5, 11, 5, 4, 0, 6, 17, 7, 2, 3, 5,
11, 8, 9, 18, 6, 9, 8, 5, 16, 20, 0, 4, 8, 13, 16, 3, 20,
0, 5, 4, 2, 11, 0, 3, 0, 6, 6, 6, 9, 4, 6, 5, 11, 0,
13, 6, 2, 11, 7, 5, 6, 18, 12, 21, 17, 3, 6, 0, 13, 21, 7,
3, 2, 18, 22, 7, 3, 2, 6, 7, 8, 4, 0, 7, 12, 3, 7, 3,
2, 11, 19, 11, 6, 2, 9, 3, 7, 9, 9, 5, 6, 8, 0, 18, 11,
3, 12, 2, 6, 4, 7, 7, 11, 3, 6, 6, 0, 6, 12, 15, 3, 9,
3, 3, 0, 5, 9, 7, 9, 11, 7, 3, 20, 0, 7, 6, 6, 23, 15,
19, 0, 3, 6, 16, 13, 5, 6, 6, 3, 6, 11, 9, 0, 6, 23, 16,
4, 0, 6, 17, 11, 17, 11, 4, 3, 13, 3, 17, 16, 11, 7, 4, 24,
5, 2, 7, 7, 8, 3, 3, 11, 8, 7, 23, 7, 7, 11, 7, 11, 6,
15, 3, 25, 7, 4, 5, 3, 17, 20, 3, 26, 7, 9, 6, 6, 17, 20,
1, 0, 11, 9, 16, 20, 7, 7, 26, 3, 6, 20, 7, 2, 11, 7, 27,
9, 4, 26, 28, 8, 6, 9, 19, 7, 29, 3, 2, 26, 30, 6, 31, 6,
18, 3, 0, 18, 4, 7, 32, 0, 2, 8, 0, 5, 9, 4, 16, 6, 23,
0, 7, 0, 7, 9, 6, 8, 3, 7, 9, 3, 3, 12, 11, 8, 19, 20,
7, 3, 5, 11, 3, 11, 8, 4, 4, 6, 9, 4, 1, 3, 0, 9, 9,
6, 7, 8, 33, 8, 7, 9, 34, 11, 11, 6, 9, 9, 17, 8, 19, 0,
7, 4, 17, 6, 7, 0, 4, 12, 7, 6, 4, 16, 12, 9, 6, 6, 6,
6, 26, 13, 9, 7, 2, 7, 3, 11, 3, 6, 7, 19, 4, 8, 9, 13,
11, 15, 11, 4, 18, 7, 7, 7, 0, 5, 4, 6, 0, 3, 7, 4, 25,
18, 6, 19, 7, 9, 4, 20, 6, 3, 7, 4, 35, 15, 11, 2, 12, 0,
7, 32, 6, 18, 9, 9, 6, 2, 3, 19, 36, 32, 0, 7, 0, 9, 37,
3, 5, 6, 5, 34, 2, 6, 0, 7, 0, 7, 3, 7, 4, 18, 18, 7,
3, 7, 16, 9, 19, 13, 4, 16, 19, 3, 19, 38, 9, 4, 9, 8, 0,
17, 0, 2, 3, 5, 6, 5, 11, 11, 2, 9, 5, 33, 9, 5, 6, 20,
13, 3, 39, 13, 7, 0, 9, 0, 4, 6, 7, 16, 7, 0, 21, 5, 3,
18, 5, 20, 2, 2, 14, 6, 17, 11, 11, 16, 16, 9, 8, 11, 3, 23,
0, 11, 0, 6, 0, 0, 3, 16, 6, 7, 5, 9, 7, 13, 0, 20, 0,
25, 6, 16, 8, 4, 4, 2, 8, 7, 5, 40, 3, 8, 5, 12, 8, 9,
6, 6, 6, 6, 3, 7, 26, 4, 0, 13, 4, 3, 13, 12, 7, 7, 6,
7, 19, 15, 0, 33, 4, 5, 5, 20, 3, 11, 5, 4, 7, 9, 7, 11,
36, 9, 0, 6, 6, 11, 6, 4, 2, 5, 18, 8, 5, 5, 2, 25, 4,
41, 7, 7, 5, 7, 3, 36, 11, 6, 9, 0, 9, 0, 16, 42, 11, 11,
18, 9, 5, 36, 2, 9, 6, 3, 43, 9, 17, 13, 5, 9, 3, 4, 6,
44, 37, 0, 45, 2, 18, 8, 46, 2, 12, 9, 9, 3, 16, 6, 12, 9,
0, 11, 11, 0, 25, 8, 17, 4, 4, 3, 11, 3, 11, 6, 6, 9, 7,
23, 0, 2, 0, 3, 3, 4, 4, 9, 5, 11, 16, 7, 3, 18, 11, 7,
6, 6, 6, 5, 9, 6, 3, 9, 7, 17, 11, 4, 9, 2, 3, 0, 26,
9, 0, 20, 8, 9, 6, 11, 6, 6, 7, 26, 6, 6, 4, 19, 5, 41,
19, 18, 29, 6, 5, 13, 6, 11, 7, 7, 6, 8, 5, 0, 3, 13, 17,
6, 20, 11, 6, 9, 6, 2, 7, 11, 9, 20, 12, 7, 6, 8, 7, 4,
6, 2, 0, 7, 9, 26, 9, 16, 7, 4, 45, 7, 0, 23, 8, 4, 19,
4, 26, 11, 4, 4, 5, 7, 3, 0, 29, 12, 3, 4, 11, 4, 12, 8,
7, 5, 0, 47, 12, 0, 25, 6, 16, 20, 5, 8, 4, 4, 11, 12, 0,
6, 3, 11, 4, 3, 48, 3, 6, 7, 4, 7, 0, 3, 7, 3, 18, 6,
2, 9, 9, 11, 3, 9, 6, 18, 16, 6, 34, 2, 7, 4, 3, 45, 5,
0, 7, 2, 17, 17, 9, 18, 5, 6, 5, 15, 5, 7, 6, 9, 0, 7,
12, 17])