scikit learn transform multiple text features

Question

I'm trying to classify multiple text features to a status. The data includes messages (errors and warnings) from different servers with the components and will result in different states. For example:

ServerName     Name     Description                               Severity   State
-------------- -------- ----------------------------------------- ---------- -------------
QWERT-XY-123   MySQL    Service not available on target machine   error      important
QWERT-XY-146   Oracle   Service caused an error                   warning    unimportant
...

This is a part of the vectorizing:

from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer()

X_Servername = df["ServerName"].values
X_Name = df["Name"].values
X_Description = df["Description"].values
X_Severity = df["Severity"].values
y = df["State"].values

X_Servername = vectorizer.transform(X_Servername)
X_Name = vectorizer.transform(X_Name)
X_Description = vectorizer.transform(X_Description)

features=list(zip(X_Servername,X_Name,X_Description,X_Severity))

Now I want to fit the Model:

from sklearn.svm import SVC

model = SVC(kernel = "linear", probability=True)
model.fit(features, y)

And the result is the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-183-71455dd49f0b> in <module>()
  2 
  3 model = SVC(kernel = "linear", probability=True)
----> 4 model.fit(features, y)
  5 
  6 #print(model.score(X_test, y))

D:\Enviroment\Anaconda3\lib\site-packages\sklearn\svm\base.py in fit(self, X, y, sample_weight)
147         self._sparse = sparse and not callable(self.kernel)
148 
149 -->     X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
150         y = self._validate_targets(y)
151 

D:\Enviroment\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
571     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
572                     ensure_2d, allow_nd, ensure_min_samples,
573 -->                 ensure_min_features, warn_on_dtype, estimator)
574     if multi_output:
575         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

D:\Enviroment\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431                                       force_all_finite)
432     else:
433 -->     array = np.array(array, dtype=dtype, order=order, copy=copy)
434 
435         if ensure_2d:

ValueError: setting an array element with a sequence.

So my question is about how to use multiple features with the hashingvectorizer or is the only way putting all features into one line?

Thanks for your help.

Update

The failer is on how to build the vectorized feature list. Instead of:

features=list(zip(X_Servername,X_Name,X_Description,X_Severity))

I now uses this function where extracted appends all created vectorized values (X_ServerName, X_Name, ...):

def combine(extracted):
    if any(sparse.issparse(fea) for fea in extracted):
        stacked = sparse.hstack(extracted).tocsr()
        stacked = stacked.toarray()
    else:
        stacked = np.hstack(extracted)

    return stacked

You never `fit` your vectorizer before you attempt to transform your data. I'm guessing your output isn't what you think it is before you try to fit the SVC — G. Anderson, Feb 19 '19 at 17:03
Hi @G.Anderson thanks for your reply. I `fit` the vectorizer with `fit_transform` but there is still the same error — Ax3l, Feb 19 '19 at 17:15
Possible duplicate of [ValueError: setting an array element with a sequence. while using SVM in scikit-learn](https://stackoverflow.com/questions/25485503/valueerror-setting-an-array-element-with-a-sequence-while-using-svm-in-scikit) — G. Anderson, Feb 19 '19 at 17:49

Sergey Bushmanov · Accepted Answer · 2019-02-19T18:45:07.897

Please try the code below:

from sklearn_pandas import DataFrameMapper, gen_features
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.preprocessing import LabelEncoder

cat_features = ["ServerName", "Name", "Description", "Severity"]
gf = gen_features(cat_features, [HashingVectorizer])
mapper = DataFrameMapper(gf)
cat_features_transformed = mapper.fit_transform(df)

target_name_encoded = LabelEncoder().fit_transform(df["State"])

from sklearn.svm import SVC

model = SVC(kernel = "linear", probability=True)
model.fit(cat_features_transformed, target_name_encoded)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=True, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

### For test/prediction part ###

test_features_transformed = mapper.transform(df_test)
predictions = model.predict(test_features_transformed)

Note, you may need to run

pip install sklearn-pandas

if you do not have sklearn-pandas installed on your machine.

The aforementioned solution will allow you (1) transform your data to suitable format and later (2) apply the same fitted transformations to your test data via transform method.

Please let us know if this helps

Is there an advantage of using sklearn-pandas to building a solution based on column transformer or feature union and incorporating these into a pipeline? — KRKirov, Feb 19 '19 at 19:02
Seems to solve my problem. The model can be `fit`. I will test it tomorrow :-) — Ax3l, Feb 19 '19 at 19:15
@KRKirov `DataFrameMapper` and `ColumnTransformer` are basically the same, the code of using `gen_features` is knitter. But you always can achieve the same by writing the sequence of transformations explicitly. — Sergey Bushmanov, Feb 20 '19 at 05:45
@SergeyBushmanov, thanks for the response. Pardon me for saying this, but I find the solution based on sklearn-pandas somewhat untidy. It would have probably been easier to read a solution based on a pipeline using the standard sklearn transformers. — KRKirov, Feb 20 '19 at 11:41

scikit learn transform multiple text features

Update

1 Answers1