The dataset consists of two 2-dimensional matrices X
and Y
, both with n
rows (number of measurements) and m
columns describing the corresponding features of each measurement. From the first matrix I would like to obtain the kernel PCA components. Additionally, using cross-decomposition I want to obtain the linear relations between both matrices using PLS and CCA.
The goal is to use a Pipeline to create for each row n
of the first matrix a feature vector consisting of its kernel PCA components and additionally of its projection on the latent spaces found by PLS and CCA, respectively. For each row of the matrix X
its feature vector shall be classified by an SVM in a binary classification task, with the labels available as train_labels
and test_labels
. The Y
matrix is thus only used for the computation of the joint latent space on which X
is projected.
What is the best way of achieving this, considering that Kernel PCA fits only on the X_train data (first matrix), while PLS and CCA fit on both X_train and Y_train (both matrices)?
My code until now (not working):
n_comp = 3
plsca = PLSCanonical(n_components=n_comp)
cca = CCA(n_components=n_comp)
kpca = KernelPCA(kernel="rbf", fit_inverse_transform=False, gamma=10, n_components=n_comp)
x_tranf_kpca = kpca.fit_transform(X_train)
svm = SVC(probability=True, class_weight='balanced', tol=0.0001)
comb_feat_bna_sg = FeatureUnion([('pls_canonical', plsca), ('cca', cca)])
x_feats_bna_sg = comb_feat_bna_sg.fit(X_train, Y_train).transform(X_train)
pipe_bna = Pipeline([('kpca', kpca)])
pipe_bna_sg = Pipeline([("x_feats_bna_sg", comb_feat_bna_sg)])
combined_features = FeatureUnion([('bna', pipe_bna), ('bna_sg', pipe_bna_sg)])
pipe = Pipeline([("features", combined_features), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ("svm", svm)])
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_pipe = dict(features__bna_sg__x_feats_bna_sg__pls_canonical__n_components=[1, 2],
features__bna_sg__x_feats_bna_sg__cca__n_components=[1, 2],
features__bna__kpca__n_components=[1, 2],
svm__kernel=["rbf"],
svm__C=[10],
svm__gamma=[1e-2]
)
clf = dcv.GridSearchCV(pipe, param_pipe, cv=10)
clf.fit(X_train, train_labels)
y_predict = clf.predict(X_test)
Edit 1
I think the error is very closely related to the one described here, where the answer states
The answer to your question about using PLSSVD within a Pipeline in cross_val_score, is no, it will not work out of the box, because the Pipeline object calls fit and transform using both variables X and Y as arguments if possible, which, as you can see in the code I wrote, returns a tuple containing the projected X and Y values. The next step in the pipeline will not be able to process this, because it will think that this tuple is the new X.
My exception stack trace:
Traceback (most recent call last):
File "D:/Network/SK_classifier_orders_Pipeline.py", line 236, in <module>
train_svm_classifier()
File "D:/Network/SK_classifier_orders_Pipeline.py", line 127, in train_svm_classifier
clf.fit(X_train, train_labels)
File "C:\ProgramData\Anaconda3\lib\site-packages\dask_searchcv-0+unknown-py3.6.egg\dask_searchcv\model_selection.py", line 867, in fit
File "C:\ProgramData\Anaconda3\lib\site-packages\dask\threaded.py", line 75, in get
pack_exception=pack_exception, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py", line 521, in get_async
raise_exception(exc, tb)
File "C:\ProgramData\Anaconda3\lib\site-packages\dask\compatibility.py", line 60, in reraise
raise exc
File "C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py", line 290, in execute_task
result = _execute_task(task, data)
File "C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py", line 271, in _execute_task
return func(*args2)
File "C:\ProgramData\Anaconda3\lib\site-packages\dask_searchcv-0+unknown-py3.6.egg\dask_searchcv\methods.py", line 187, in feature_union_concat
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 288, in hstack
arrs = [atleast_1d(_m) for _m in tup]
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 288, in <listcomp>
arrs = [atleast_1d(_m) for _m in tup]
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 52, in atleast_1d
ary = asanyarray(ary)
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 583, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
ValueError: could not broadcast input array from shape (5307,1) into shape (5307)
Edit 2
Upon generating the feature vectors for the first matrix (X), in the last step of the Pipeline, an SVM should be used to classify them in two classes. The labels for the training data are available as a binary vector train_labels
.