SKLearn Multi Classification without Knowing the Classifications in Advance Python

Question

I have recently got in to using SKLearn, especially Classification models and had a question more on use case examples, than being stuck on any particular bit of code, so apolgies in advance if this isn't the right place to be asking questions such as this.

So far I have been using sample data where one trains the model based on data that has already been classified. The 'Iris' data set for example, all the data is classified in to one of the three species. But what if one wants to group/classify the data without knowing the classifications in the first place.

Let's take this imaginary data:

  Name  Feat_1  Feat_2  Feat_3  Feat_4
0    A      12    0.10       0    9734
1    B      76    0.03       1   10024
2    C      97    0.07       1    8188
3    D      32    0.21       1    6420
4    E      45    0.15       0    7723
5    F      61    0.02       1   14987
6    G      25    0.22       0    5290
7    H      49    0.30       0    7107

If one wanted to split the names in to 4 separate classifications, using the different features, is this possible, and which SKLearn model(s) is needed? I'm not asking for any code, I'm quite able to research on my own if someone could point me in the right direction? So far I can only find examples where the classifications are already known.

In the example above, if I wanted to break the data down in to 4 classifications I would want my outcome to be something like this (note the new column, denoting the class):

  Name  Feat_1  Feat_2  Feat_3  Feat_4  Class
0    A      12    0.10       0    9734      4
1    B      76    0.03       1   10024      1
2    C      97    0.07       1    8188      3
3    D      32    0.21       1    6420      3
4    E      45    0.15       0    7723      2
5    F      61    0.02       1   14987      1
6    G      25    0.22       0    5290      4
7    H      49    0.30       0    7107      4

Many thanks for any help

score 1 · Accepted Answer · answered Sep 17 '19 at 16:54

you can you k-mean clustering which will group data into lesser in lesser classes in each iteration until all data are grouped in 1 group. Then you can either stop the iteration early when number of classes are what you wanted or you can also go back on already trained model to get number of class you want. For example to get 4 classes you can go 4 steps back when data are clustered in 4 classes

sklearn.cluster.KMeans doc

score 1 · Answer 2 · answered Sep 17 '19 at 16:56

1

Classification is a supervised approach, meaning that the training data comes with features and labels. If you want to group the data according to the features, then you can go for some clustering algorithms (unsupervised), such as sklearn.cluster.KMeans (with k = 4).

answered Sep 17 '19 at 16:56

phoenix

78
1
5

Thanks very much for your help! – top bantz Sep 17 '19 at 17:03

score 1 · Answer 3 · answered Sep 17 '19 at 17:09

1

Start with an unsupervised method to determine clusters... use those clusters as your labels.

I recommend using sklearn's GMM instead of k-means.

https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

K-means assumes circular clusters.

answered Sep 17 '19 at 17:09

Kermit

4,922
4
42
74

Thank you mate, really appreciate the help – top bantz Sep 17 '19 at 17:46

score 1 · Answer 4 · answered Sep 18 '19 at 06:50

This topic is called: unsupervised learning

Some definition is:

Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs.[1] It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.

There are tons of algorithms out there, you need to try what fits best for your algorithms, some examples are:

Hieracrchical clustering (implemented in Scipy: https://en.wikipedia.org/wiki/Single-linkage_clustering)
kmeans (implemented in sklearn: https://en.wikipedia.org/wiki/K-means_clustering)
Dbscan (implemented in sklearn: https://en.wikipedia.org/wiki/DBSCAN)

SKLearn Multi Classification without Knowing the Classifications in Advance Python

4 Answers4