What is the reason of splitting data to training/testing in SOM?

Question

I am doing a research and reading some papers using SOM algorithm. I do not understand the logic that people splitting their dataset into training/test sets for SOM. I mean, for example while C4.5 Decision Tree is used, a trained structure includes some rules to be applied when a new dataset (test) comes to classify the data there. However, what kind of rules or something similar are generated after a system is trained via SOM? What would be the differentce if I apply 100% of my data to a SOM system instead using 30% for training first then using 70% for testing? Thanks for your answers in advance.

score 0 · Answer 1 · answered Aug 07 '16 at 22:12

For every system which is data dependent, which is supposed to be exposed to new data in the future, holding out part of the existing data to do the testing gives you ability to robustly predict how it will predict once it is deployed. For SOM, you learn a specific data embedding. If you use all your data for training and later on want to use this trained SOM on never seen before data - you have no guarantees how it will behave (how good is this representation for the task at hand). Having a hold out gives you ability to test this in controlled environment - you train SOM representation on part of your data and then apply it to embed hold out (test), which simulates "what would happen if I get new data and want to use my SOM on it". The same applies to every single algorithm using data, no matter if it is supervised or not, if you are going to deploy something based on this model, you need a test set for building confidence in your own solution. If, on the other hand, you are just doing exploratory analysis of "closed" set of data - then unsupervised methods can be simply applied to all of them (if you are just asking "what is the structure in this particular dataset).

score 0 · Answer 2 · answered Dec 01 '17 at 08:35

It seem you do not see why SOM (unsupervised machine learning) should be treated like other Machine Learning techniques hence your statement: "... a trained structure includes some rules to be applied when a new dataset (test) comes to classify the data there.."

In general, during training (including that of SOM), you aim to end up with a set of final weights (to use your words; "the rules to be applied") to be used for a new, previously unseen dataset. The training set should incorporate a wide range of features, typically a good representative of the kind of data you expect to apply it to.

This will enable the final weights to be as accurate and as a reliable as possible. As to "what kind of rules or something similar are generated after a system is trained via SOM?" The final weights constituent the "rules" to be applied to any new data subjected to the SOM. Hence the SOM will give you results based on the values in its final weights.

Splitting data into training and testing assist you gain confidence of the performance of the trained SOM before putting it into production.

The testing set, on the other hand, allows you to see how well the trained SOM performs. You compare the results from the training set and that from the testing set. This is important before you apply and commence using the trained SOM. If you find big discrepancies between the results from training set and the testing set, you should review the training set - probably include more diversed features in the training set.
In short, having training and testing set can assure you of the performance of the SOM when it is implemented. As stated here:

" ... we create test partitions to provide us honest assessments of the performance of our predictive models. No amount of mathematical reasoning and manipulation of results based on the training data will be convincing to an experienced observer."

What is the reason of splitting data to training/testing in SOM?

2 Answers2