Machine Learning - training data vs 'has to be classified' data

Question

i have a general question about data pre-processing for machine learning. I know that it is almost a must do to center the data around 0 (mean subtraction), normalize the data (remove the variance). There are other possible techniques. This hast to be used for training-data and validation data sets.

I have encountered a following problem. My neural network, trained to classify specific shapes in images, fails to do so if i do not apply this pre-processing techniques to the images that has to be classified. This 'to classify' images are of course not contained in training set or validation set. By thus my question:

Is it normal to apply normalization to data, which has to be classified, or does the bad performance of my network without this techniques mean, that my model is bad in the sense, that it has failed to generalize and over fitted?

P.S. with normalization used on 'to classify' images, my model performs quite well (about 90% accuracy), without below 30%.

Additional info: model: convolutional neural network with keras and tensorflow.

This question is not suited for stack overflow. https://stats.stackexchange.com/ might be the better choice. Also this question is very brooad and without knowing your data set/architecture it's hard to give a meaningful answer. In my personal opinion I would say it's possible that preprocessing can make such a big diffference. — dennis-w, Jul 12 '18 at 13:40

score 0 · Accepted Answer · answered Jul 12 '18 at 13:44

It goes without saying (although admittedly it is seldom mentioned explicitly in introductory tutorials, hence the frequent frustration of beginners) that new data fed to the model for classification have to undergo the very same pre-processing steps followed for the training (and test) data.

Some common sense is certainly expected here: in all kinds of ML modeling, new input data are expected to have the same "general form" with the original data used for training & testing; the opposite case (i.e. what you have been trying to perform), if you stop for a moment to think about it, you should be able to convince yourself that does not make much sense...

The following answers may help you clarify the idea, illustrating also the case of inverse transforming the predictions whenever necessary:

How to predict a function/table using Keras?

Getting very bad prediction with KerasRegressor

I had the feeling that it has to be that way, and applying the same normalization to 'to be classified' data is the right logical step. I was just wondering as i failed to find anything on this topic as you said. Somehow they do not include this in the tutorials or use already normalized data. Thank you for a fast answer. — Apolonius, Jul 12 '18 at 13:48

Machine Learning - training data vs 'has to be classified' data

1 Answers1