I'm trying to classify multiple text features to a status. The data includes messages (errors and warnings) from different servers with the components and will result in different states. For example:
ServerName Name Description Severity State
-------------- -------- ----------------------------------------- ---------- -------------
QWERT-XY-123 MySQL Service not available on target machine error important
QWERT-XY-146 Oracle Service caused an error warning unimportant
...
This is a part of the vectorizing:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer()
X_Servername = df["ServerName"].values
X_Name = df["Name"].values
X_Description = df["Description"].values
X_Severity = df["Severity"].values
y = df["State"].values
X_Servername = vectorizer.transform(X_Servername)
X_Name = vectorizer.transform(X_Name)
X_Description = vectorizer.transform(X_Description)
features=list(zip(X_Servername,X_Name,X_Description,X_Severity))
Now I want to fit the Model:
from sklearn.svm import SVC
model = SVC(kernel = "linear", probability=True)
model.fit(features, y)
And the result is the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-183-71455dd49f0b> in <module>()
2
3 model = SVC(kernel = "linear", probability=True)
----> 4 model.fit(features, y)
5
6 #print(model.score(X_test, y))
D:\Enviroment\Anaconda3\lib\site-packages\sklearn\svm\base.py in fit(self, X, y, sample_weight)
147 self._sparse = sparse and not callable(self.kernel)
148
149 --> X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
150 y = self._validate_targets(y)
151
D:\Enviroment\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
571 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
572 ensure_2d, allow_nd, ensure_min_samples,
573 --> ensure_min_features, warn_on_dtype, estimator)
574 if multi_output:
575 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
D:\Enviroment\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
433 --> array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: setting an array element with a sequence.
So my question is about how to use multiple features with the hashingvectorizer or is the only way putting all features into one line?
Thanks for your help.
Update
The failer is on how to build the vectorized feature list. Instead of:
features=list(zip(X_Servername,X_Name,X_Description,X_Severity))
I now uses this function where extracted
appends all created vectorized values (X_ServerName, X_Name, ...):
def combine(extracted):
if any(sparse.issparse(fea) for fea in extracted):
stacked = sparse.hstack(extracted).tocsr()
stacked = stacked.toarray()
else:
stacked = np.hstack(extracted)
return stacked