2

Just a machine learning/data science problem.

a) Let's say I have a dataset of 20 features, and i decide to use 3 features to perform unsupervised learning of clustering - and ideally this produces 3 clusters (A,B and C).

b) Then i fit that output result (cluster A, B or C) back into my dataset as a new feature (i.e. now total of 21 features).

c) I run a regression model to predict a label value with the 21 features.

Wonder if step b) is redundant (since the features already exist in the earlier dataset), if I use a more powerful model (Random forest, XGBoost), or not, and how to explain this mathematically.

Any opinions and suggestions will be great!

Gabriel
  • 438
  • 1
  • 5
  • 16
  • What about giving that idea a quick try? e.g. comparing performance with/without additional features taken from clustering? Saw that here using Iris although not sure very reliable https://towardsdatascience.com/how-to-create-new-features-using-clustering-4ae772387290 – arnaud Feb 27 '20 at 09:12
  • you can certainly do this. as @Frederik Bode said, you would use two separate models for this. The use of the unsupervised model can thus be considered a further step in feature engineering. Optionally, it would also be possible to use this method to remove noise from the input data, but in this case you would rather not send the original features into the second model. I would try out which variant fits best. – SerAlejo Feb 27 '20 at 09:13
  • I'm curious to know if there's a mathematical way of thinking about this. but perhaps it's only by experimentation... haha – Gabriel Feb 27 '20 at 09:28

2 Answers2

1

Aha nice one! You might think you are using two models, but actually you are combining two models into one, with skip connections. As it is one model, there is no way knowing for sure what is the best architecture beforehand, per the No Free Lunch Theorem. So, practically, you have have to try it out, and mathematically, there's no knowing it beforehand, because of the No Free Lunch Theorem.

Frederik Bode
  • 2,632
  • 1
  • 10
  • 17
  • 1
    Not sure how this helps the author- do you have any further justification? Or suggestion? – arnaud Feb 27 '20 at 09:08
  • As it is one model, there is no way knowing for sure what is the best architecture beforehand, per the No Free Lunch Theorem. So, practically, you have have to try it out, and mathematically, there's no knowing it beforehand, because of the No Free Lunch Theorem. – Frederik Bode Feb 27 '20 at 09:13
  • Do you think that's better? If not I will delete my answer. – Frederik Bode Feb 27 '20 at 09:14
  • Hmm, guess the only way is experimentation? No mathematical way to analyze it? – Gabriel Feb 27 '20 at 10:04
1

Great idea: just give it a try and see how that goes. This is highly dependent on your dataset and model choice as you guessed. Hard to predict how adding this type of feature will behave, just like any other feature engineering. But caution, in some cases it's not even improving your performance. See a test below where performance actually decreases, with Iris dataset:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn import metrics

# load data
iris = load_iris()
X = iris.data[:, :3]  # only keep three out of the four available features to make it more challenging
y = iris.target

# split train / test
indices = np.random.permutation(len(X))
N_test = 30
X_train, y_train = X[indices[:-N_test]], y[indices[:-N_test]]
X_test, y_test = X[indices[N_test:]], y[indices[N_test:]]

# compute a clustering method (here KMeans) based on available features in X_train
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
new_clustering_feature_train = kmeans.predict(X_train)
new_clustering_feature_test = kmeans.predict(X_test)

# create a new input train/test X with this feature added
X_train_with_clustering_feature = np.column_stack([X_train, new_clustering_feature_train])
X_test_with_clustering_feature = np.column_stack([X_test, new_clustering_feature_test])

Now let's compare the two models that learnt either only on X_train or on X_train_with_clustering_feature:

model1 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train, y_train)
print(metrics.classification_report(model1.predict(X_test), y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        45
           1       0.95      0.97      0.96        38
           2       0.97      0.95      0.96        37

    accuracy                           0.97       120
   macro avg       0.97      0.97      0.97       120
weighted avg       0.98      0.97      0.97       120

And the other model:

model2 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train_with_clustering_feature, y_train)
print(metrics.classification_report(model2.predict(X_test_with_clustering_feature), y_test))

           0       1.00      1.00      1.00        45
           1       0.87      0.97      0.92        35
           2       0.97      0.88      0.92        40

    accuracy                           0.95       120
   macro avg       0.95      0.95      0.95       120
weighted avg       0.95      0.95      0.95       120
arnaud
  • 3,293
  • 1
  • 10
  • 27