4

I have a dataframe which looks like this:

label    predicted     F1  F2   F3 .... F40
major     minor         2   1   4
major     major         1   0   10
minor     patch         4   3   23
major     patch         2   1   11
minor     minor         0   4   8
patch     major         7   3   30
patch     minor         8   0   1
patch     patch         1   7   11

I have label which is the true label for the id(not shown as it is not relevant), and predicted label, and then set of around 40 features in my df.

The idea is to transform these 40 features into 2 dimensions and visualize them true vs predicted. We have 9 cases for all the three labels major,minor and patch vs their predictions.

With PCA, it is not able to capture much variance with 2 components and I am not sure how to map the PCA values with the labels and predictions in the original df as a whole. A way to achieve this is to separate all cases into 9 dataframes and achieve the result, but this isn't what I am looking for.

Is there any other way I can reduce and visualize the given data? Any suggestions would be highly appreciated.

  • What do you mean by "visualize true vs predicted"? What is the goal? I thought dimension analysis was used to reduce the number of features for a regression/classification model. Are you sure you don't want to show the correlation for each case (9) between features? – Corralien Apr 10 '23 at 18:54
  • No the correlation part has already been done, for my analysis, we ran a GaussianNB model, training on all the 40 features on my dataset, to see how the predicted labels by the model compare with the original labels done manually. Now we want to see how the clusters are being formed for each of the 9 cases, that means are they correlated more towards one pca for one case and maybe the second pca for the other. This is somewhat the goal. – Brie MerryWeather Apr 10 '23 at 19:20
  • I have also already done something similar to this: https://www.reneshbedre.com/blog/principal-component-analysis.html, analysis of the loadings for all my feature names, but this hasn't been that useful, as there are a lot of features and it overlaps, which means a separate plot for the important features, through which i lose out on patterns for the other features( not so important one's) – Brie MerryWeather Apr 10 '23 at 19:21

1 Answers1

2

You may want to consider a small multiple plot with one scatterplot for each cell of the confusion matrix.

If PCA does not work well, t-distributed stochastic neighbor embedding (TSNE) is often a good alternative in my experience.

For example, with the iris dataset, which also has three prediction classes, it could look like this:

import pandas as pd
import seaborn as sns
from sklearn.manifold import TSNE

iris = sns.load_dataset('iris')

# Mock up some predictions.
iris['species_pred'] = (40 * ['setosa'] + 5 * ['versicolor'] + 5 * ['virginica']
                        + 40 * ['versicolor'] + 5 * ['setosa'] + 5 * ['virginica']
                        + 40 * ['virginica'] + 5 * ['versicolor'] + 5 * ['setosa'])

# Show confusion matrix.
pd.crosstab(iris.species, iris.species_pred)
  species_pred  setosa  versicolor  virginica
species             
setosa              40           5          5
versicolor           5          40          5
virginica            5           5         40
# Reduce features to two dimensions.
X = iris.iloc[:, :4].values
X_embedded = TSNE(n_components=2, init='random', learning_rate='auto'
                 ).fit_transform(X)
iris[['tsne_x', 'tsne_y']] = X_embedded

# Plot small multiples, corresponding to confusion matrix.
sns.set()
g = sns.FacetGrid(iris, row='species', col='species_pred', margin_titles=True)
g.map(sns.scatterplot, 'tsne_x', 'tsne_y');

small multiples plot

Arne
  • 9,990
  • 2
  • 18
  • 28
  • 1
    Thanks for explaining it so well! It really helped me alot, turns out t-sne gave the best clusters, I tried it on pca and umap as well, but the cluster formation was not that defined. – Brie MerryWeather Apr 15 '23 at 22:25