0

I'm pretty new to Machine Learning, and I'm trying something experimental on a public wine dataset. I'm ending up with an error and I can't find a solution.

Here is what I'm trying do with my model:

X = data_all[['country', 'description', 'price', 'province', 'variety']]
y = data_all['points']

# Vectorizing Description column (text analysis)
vectorizerDesc = CountVectorizer()
descriptions = X['description']
vectorizerDesc.fit(descriptions)
vectorizedDesc = vectorizer.transform(X['description'])
X['description'] = vectorizedDesc

# Categorizing other string columns
X = pd.get_dummies(X, columns=['country', 'province', 'variety'])

# Generating train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# Multinomial Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)

Here's what X looks like just before calling train_test_split:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83945 entries, 25 to 150929
Columns: 837 entries, description to variety_Zweigelt
dtypes: float64(1), object(1), uint8(835)

The last line (nb.fit) gives me an error:

ValueError                                Traceback (most recent call last)
<ipython-input-197-9d40e4624ff6> in <module>()
      3 # Multinomial Naive Bayes is a specialised version of Naive Bayes designed more for text documents
      4 nb = MultinomialNB()
----> 5 nb.fit(X_train, y_train)

/opt/conda/lib/python3.6/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    577             Returns self.
    578         """
--> 579         X, y = check_X_y(X, y, 'csr')
    580         _, n_features = X.shape
    581 

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    571     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    572                     ensure_2d, allow_nd, ensure_min_samples,
--> 573                     ensure_min_features, warn_on_dtype, estimator)
    574     if multi_output:
    575         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    446         # make sure we actually converted to numeric:
    447         if dtype_numeric and array.dtype.kind == "O":
--> 448             array = array.astype(np.float64)
    449         if not allow_nd and array.ndim >= 3:
    450             raise ValueError("Found array with dim %d. %s expected <= 2."

ValueError: setting an array element with a sequence.

Would you know how I could combine my Vectorized text analysis and other datasets (like countries etc...) in a Multinomial NB algorithm?

Thank you in advance :)

Olivier.G
  • 143
  • 1
  • 2
  • 12
  • `vectorizer.transform()` returns a sparse matrix, which is not handled as you wanted when assigning it to a dataframe column. Check the values once. The `vectorizedDesc` is not a single column which you can assign to a column in pandas. It is an array and will require multiple columns. – Vivek Kumar Apr 30 '18 at 07:08
  • Ohhh I see. Thank you! Let me try with toarray() or todense() with the vectorizedDesc. Or I heard scipy.sparse.hstack is better, memory wise, to convert everything to a matrix. – Olivier.G Apr 30 '18 at 16:07

0 Answers0