0

The dataset consists of two 2-dimensional matrices X and Y, both with n rows (number of measurements) and mcolumns describing the corresponding features of each measurement. From the first matrix I would like to obtain the kernel PCA components. Additionally, using cross-decomposition I want to obtain the linear relations between both matrices using PLS and CCA.

The goal is to use a Pipeline to create for each row n of the first matrix a feature vector consisting of its kernel PCA components and additionally of its projection on the latent spaces found by PLS and CCA, respectively. For each row of the matrix X its feature vector shall be classified by an SVM in a binary classification task, with the labels available as train_labels and test_labels. The Y matrix is thus only used for the computation of the joint latent space on which X is projected.

What is the best way of achieving this, considering that Kernel PCA fits only on the X_train data (first matrix), while PLS and CCA fit on both X_train and Y_train (both matrices)?

My code until now (not working):

n_comp = 3

plsca = PLSCanonical(n_components=n_comp)
cca = CCA(n_components=n_comp)
kpca = KernelPCA(kernel="rbf", fit_inverse_transform=False, gamma=10, n_components=n_comp)
x_tranf_kpca = kpca.fit_transform(X_train)

svm = SVC(probability=True, class_weight='balanced', tol=0.0001)

comb_feat_bna_sg = FeatureUnion([('pls_canonical', plsca), ('cca', cca)])
x_feats_bna_sg = comb_feat_bna_sg.fit(X_train, Y_train).transform(X_train)

pipe_bna = Pipeline([('kpca', kpca)])
pipe_bna_sg = Pipeline([("x_feats_bna_sg", comb_feat_bna_sg)])

combined_features = FeatureUnion([('bna', pipe_bna), ('bna_sg', pipe_bna_sg)])

pipe = Pipeline([("features", combined_features), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ("svm", svm)])

# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_pipe = dict(features__bna_sg__x_feats_bna_sg__pls_canonical__n_components=[1, 2],
                  features__bna_sg__x_feats_bna_sg__cca__n_components=[1, 2],
                  features__bna__kpca__n_components=[1, 2],
                  svm__kernel=["rbf"],
                  svm__C=[10],
                  svm__gamma=[1e-2]
                  )

clf = dcv.GridSearchCV(pipe, param_pipe, cv=10)
clf.fit(X_train, train_labels)
y_predict = clf.predict(X_test)

Edit 1

I think the error is very closely related to the one described here, where the answer states

The answer to your question about using PLSSVD within a Pipeline in cross_val_score, is no, it will not work out of the box, because the Pipeline object calls fit and transform using both variables X and Y as arguments if possible, which, as you can see in the code I wrote, returns a tuple containing the projected X and Y values. The next step in the pipeline will not be able to process this, because it will think that this tuple is the new X.

My exception stack trace:

Traceback (most recent call last):
  File "D:/Network/SK_classifier_orders_Pipeline.py", line 236, in <module>
    train_svm_classifier()
  File "D:/Network/SK_classifier_orders_Pipeline.py", line 127, in train_svm_classifier
    clf.fit(X_train, train_labels)
  File "C:\ProgramData\Anaconda3\lib\site-packages\dask_searchcv-0+unknown-py3.6.egg\dask_searchcv\model_selection.py", line 867, in fit
  File "C:\ProgramData\Anaconda3\lib\site-packages\dask\threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "C:\ProgramData\Anaconda3\lib\site-packages\dask\compatibility.py", line 60, in reraise
    raise exc
  File "C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py", line 271, in _execute_task
    return func(*args2)
  File "C:\ProgramData\Anaconda3\lib\site-packages\dask_searchcv-0+unknown-py3.6.egg\dask_searchcv\methods.py", line 187, in feature_union_concat
  File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 288, in hstack
    arrs = [atleast_1d(_m) for _m in tup]
  File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 288, in <listcomp>
    arrs = [atleast_1d(_m) for _m in tup]
  File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\shape_base.py", line 52, in atleast_1d
    ary = asanyarray(ary)
  File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 583, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
ValueError: could not broadcast input array from shape (5307,1) into shape (5307)

Edit 2

Upon generating the feature vectors for the first matrix (X), in the last step of the Pipeline, an SVM should be used to classify them in two classes. The labels for the training data are available as a binary vector train_labels.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
AlexGuevara
  • 932
  • 11
  • 28
  • So whats the problem? Whats not working? Are you getting any error? Please post that with full stack trace. Also you are unnecessarily complicating things by wrapping a single transformer into a featureUnion and a pipeline. Keep it simple. – Vivek Kumar Dec 13 '17 at 02:13
  • I edited the question to include the stacktrace and further info. If you know how to achieve my goal with a simpler solution, please offer it as an answer. – AlexGuevara Dec 13 '17 at 06:02
  • This error comes when the FeatureUnion tries to combine the output from `plsca` and `cca`. Since both these outputs contains a tuple of form (X_array, y_array) where X_array has shape [n_samples, n_comps] and y_array has shape [n_samples, n_targets]. So please tell me how do you want to combine these arrays. Do you only want to combine the X_arrays from both `plsca` and `cca` OR do you want to first concatenate X_array and y_array into a single array and then concatenate such single arrays from both `plsca` and `cca`? – Vivek Kumar Dec 13 '17 at 08:28
  • I am only interested in the combination of the X_arrays (the projection of the data from the original X matrix on the joint latent space for both CCA and PLSCA), then their combination with the kernelPCA projection of the X_matrix. I do not need the Y projections on the joint space. – AlexGuevara Dec 13 '17 at 09:48

1 Answers1

2

As per the discussion in comments, since you only want to combine the X parts of each output , this can be done by using a custom transformer that returns the first element of the tuple returned by PLSCanonical or CCA.

class CustomXySeparator(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        if y is None:
            return X
        return X[0]

    def fit_transform(self, X, y=None):
            return self.fit(X,y).transform(X,y)


n_comp = 3

plsca = PLSCanonical(n_components=n_comp)
x_plsca = plsca.fit_transform(X_train, Y_train)

cca = CCA(n_components=n_comp)
x_cca = cca.fit_transform(X_train, Y_train)

kpca = KernelPCA(kernel="rbf", fit_inverse_transform=False, gamma=10, n_components=n_comp)
comb_feat_bna_sg = FeatureUnion([('pls_onlyX', Pipeline([("pls", plsca), ('getX', CustomXySeparator())])), 
                                 ('cca_onlyX', Pipeline([("cca", cca), ('getX', CustomXySeparator())]))])

x_feats_bna_sg = comb_feat_bna_sg.fit_transform(X_train, Y_train)

combined_features = FeatureUnion([('kpca', kpca), 
                                  ("x_feats_bna_sg", comb_feat_bna_sg)])

svm = SVC(probability=True, class_weight='balanced', tol=0.0001)

pipe = Pipeline([("features", combined_features), 
                 ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), 
                 ("svm", svm)])

# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_pipe = dict(features__x_feats_bna_sg__pls_onlyX__pls__n_components=[1, 2],
                  features__x_feats_bna_sg__cca_onlyX__cca__n_components=[1, 2],
                  features__kpca__n_components=[1, 2],
                  svm__kernel=["rbf"],
                  svm__C=[10],
                  svm__gamma=[1e-2]
                  )

clf = GridSearchCV(pipe, param_pipe, cv=10)
clf.fit(X_train, Y_train)
y_predict = clf.predict(X_test)

Please note that I have removed unnecessary wrapping of Pipeline over single transformers only like pipe_bna = Pipeline([('kpca', kpca)]) and changed the param names accordingly. Please go throough it once. And ask if not understand anything.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Thank you! What I do not fully understand: in my code above, I had `clf.fit(X_train, train_labels)`. The SVM is supposed to use the X feature vector from CCA, PLSCA and KPCA and perform (binary) classification using `train_labels`. In the new code I do not see any usage of `train_labels` anymore? – AlexGuevara Dec 13 '17 at 12:03
  • @AlexGuevara What actually is train_labels? How did you create X_train and Y_train in your code? – Vivek Kumar Dec 13 '17 at 12:13
  • We start with the two matrices, X and Y. The rows (number of measurements) are divided in train and test, so you have `X_train` and `X_test` from the first matrix, and `Y_train` and `Y_test` from the second. We want to classify `X_train` and `X_test` in 0, 1 (binary classification). As input for the SVM classificator, the feature vector which you computed is used as input (projections of the X matrix). `train_labels` and `test_labels` are two vectors consisting of 0 and 1 which are to be used by the SVM for the binary classification. – AlexGuevara Dec 13 '17 at 12:20
  • @AlexGuevara So what are Y_train and Y_test then? Arent they same as train_labels and test_labels? – Vivek Kumar Dec 13 '17 at 12:22
  • No: the labels are binary vectors. The Y matrix consists of another set of measurements, just like X; they are however physically linked together. Only the Y_train part of the Y matrix should be used by PLSCA and CCA to define the joint latent space on which X is projected. Basically we want to perform binary classification on X, and Y_train is used to jointly define another space on which we project X. – AlexGuevara Dec 13 '17 at 12:29
  • @AlexGuevara Oh ok then. I was considering y to be the actual labels as it happens most of the times. In this case, just replace the Y_train in the last line to train_labels. – Vivek Kumar Dec 13 '17 at 12:36
  • If I change the line to fit on `X_train, train_labels`, wouldn't `PLSCA` and `CCA` fit on the labels instead of finding the common latent space between the two matrices? – AlexGuevara Dec 13 '17 at 14:46
  • 1
    @AlexGuevara Ahh yes, they will. In that case, you need to separate out the svm from the rest of the pipeline. But that will restrict the usage of GridSearchCV. You may need another custom transformer for that I guess, which will take both the Y_train and train_labels into account and pass them to appropriate things. – Vivek Kumar Dec 13 '17 at 15:00
  • Could you please adapt your answer to include the extra custom transformer? In that case I will gladly accept it. – AlexGuevara Dec 14 '17 at 07:02
  • 1
    @AlexGuevara Why not combine the Y_train and train_labels into single array and then use the custom transformer decide what to send to svc and what to send to other pipeline. In this way, when the GridSearchCV uses cross-validation and splits the data into train and test, the Y_train and train_labels are splitted together. – Vivek Kumar Dec 14 '17 at 07:16
  • Also possible, keeping in mind that Y_train is a matrix and train_labels a vector. One would basically expand each row of Y_train by a 0 or 1 depending on the label, then separate them. – AlexGuevara Dec 14 '17 at 07:22
  • 2
    Yes. As rows would be same in both, we can just concatenate them easily. Possibly appending train_labels at the last of the Y_train and separating as required. What do you say? If you agree, we can have a custom wrapper over this. – Vivek Kumar Dec 14 '17 at 07:27
  • totally possible. I just do not know how to write this kind of wrappers/transformers – AlexGuevara Dec 14 '17 at 07:38
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/161241/discussion-between-alexguevara-and-vivek-kumar). – AlexGuevara Dec 14 '17 at 21:21
  • Why is it necessary to create x_cca variable? – anitasp Apr 05 '19 at 18:07
  • @VivekKumar, thank you so much. This fits my use case perfectly - a transformer that takes X and Y matrices, but only needs the transformed X scores to feed into a KNearestNeighbors step. Awesome! – grovduck Feb 25 '21 at 00:18
  • @VivekKumar, sheepishly I just realized that I have the same issue as AlexGuevara - namely that my Y that feeds into KNearestNeighbors is a set of labels and the Y that feeds into my CCA-like transformer is a species matrix. It sounds like you two were coming up with a custom transformer to split Y and feed the correct subset of Y to the separate estimators. Unfortunately, your discussion in chat no longer exists. Any solution that you came up with? – grovduck Feb 25 '21 at 01:09