I'm pretty new to Machine Learning, and I'm trying something experimental on a public wine dataset. I'm ending up with an error and I can't find a solution.
Here is what I'm trying do with my model:
X = data_all[['country', 'description', 'price', 'province', 'variety']]
y = data_all['points']
# Vectorizing Description column (text analysis)
vectorizerDesc = CountVectorizer()
descriptions = X['description']
vectorizerDesc.fit(descriptions)
vectorizedDesc = vectorizer.transform(X['description'])
X['description'] = vectorizedDesc
# Categorizing other string columns
X = pd.get_dummies(X, columns=['country', 'province', 'variety'])
# Generating train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# Multinomial Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
Here's what X looks like just before calling train_test_split
:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 83945 entries, 25 to 150929
Columns: 837 entries, description to variety_Zweigelt
dtypes: float64(1), object(1), uint8(835)
The last line (nb.fit) gives me an error:
ValueError Traceback (most recent call last)
<ipython-input-197-9d40e4624ff6> in <module>()
3 # Multinomial Naive Bayes is a specialised version of Naive Bayes designed more for text documents
4 nb = MultinomialNB()
----> 5 nb.fit(X_train, y_train)
/opt/conda/lib/python3.6/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
577 Returns self.
578 """
--> 579 X, y = check_X_y(X, y, 'csr')
580 _, n_features = X.shape
581
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
571 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
572 ensure_2d, allow_nd, ensure_min_samples,
--> 573 ensure_min_features, warn_on_dtype, estimator)
574 if multi_output:
575 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
446 # make sure we actually converted to numeric:
447 if dtype_numeric and array.dtype.kind == "O":
--> 448 array = array.astype(np.float64)
449 if not allow_nd and array.ndim >= 3:
450 raise ValueError("Found array with dim %d. %s expected <= 2."
ValueError: setting an array element with a sequence.
Would you know how I could combine my Vectorized text analysis and other datasets (like countries etc...) in a Multinomial NB algorithm?
Thank you in advance :)