I have come across a peculiar situation when preprocessing data.
Let's say I have a dataset A
. I split the dataset into A_train
and A_test
. I fit the A_train
using any of the given scalers (sci-kit learn) and transform A_test
with that scaler
. Now training the neural network with A_train
and validating on A_test
works well. No overfitting and performance is good.
Let's say I have dataset B
with the same features as in A
, but with different ranges of values for the features. A simple example of A
and B
could be Boston and Paris housing datasets respectively (This is just an analogy to say that features ranges like the cost, crime rate, etc vary significantly ). To test the performance of the above trained model on B
, we transform B
according to scaling attributes of A_train
and then validate. This usually degrades performance, as this model is never shown the data from B
.
The peculiar thing is if I fit and transform on B
directly instead of using scaling attributes of A_train
, the performance is a lot better. Usually, this reduces performance if I test this on A_test
. In this scenario, it seems to work, although it's not right.
Since I work mostly on climate datasets, training on every dataset is not feasible. Therefore I would like to know the best way to scale such different datasets with the same features to get better performance.
Any ideas, please.
PS: I know training my model with more data can improve performance, but I am more interested in the right way of scaling. I tried removing outliers from datasets and applied QuantileTransformer
, it improved performance but could be better.