0

I have come across a peculiar situation when preprocessing data.

Let's say I have a dataset A. I split the dataset into A_train and A_test. I fit the A_train using any of the given scalers (sci-kit learn) and transform A_test with that scaler. Now training the neural network with A_train and validating on A_test works well. No overfitting and performance is good.

Let's say I have dataset B with the same features as in A, but with different ranges of values for the features. A simple example of A and B could be Boston and Paris housing datasets respectively (This is just an analogy to say that features ranges like the cost, crime rate, etc vary significantly ). To test the performance of the above trained model on B, we transform B according to scaling attributes of A_train and then validate. This usually degrades performance, as this model is never shown the data from B.

The peculiar thing is if I fit and transform on B directly instead of using scaling attributes of A_train, the performance is a lot better. Usually, this reduces performance if I test this on A_test. In this scenario, it seems to work, although it's not right.

Since I work mostly on climate datasets, training on every dataset is not feasible. Therefore I would like to know the best way to scale such different datasets with the same features to get better performance.

Any ideas, please.

PS: I know training my model with more data can improve performance, but I am more interested in the right way of scaling. I tried removing outliers from datasets and applied QuantileTransformer, it improved performance but could be better.

raghu
  • 83
  • 2
  • 12
  • If the housing datasets parallelism holds, I do not see why a model trained for a specific context should be good for another context. Some features and dynamics could match, others not. Based on erroneous assumptions, your model could be severely flawed. – sentence May 03 '19 at 13:26
  • I didn't say the housing datasets parallelism holds, I meant the other dataset has the same features but they are in different range altogether. If you know how costly Paris or California is. That reference is to show an analogy. It is like training a climatic model in the US and predicting for European climate. I could always improve the performance by showing it more data, but there is a limit. – raghu May 03 '19 at 13:39

1 Answers1

0

One possible solution could be like this.

  1. Normalize (pre-process) the dataset A such that the range of each features is within a fixed interval, e.g., between [-1, 1].
  2. Train your model on the normalized set A.
  3. Whenever you are given a new dataset like B:

    • (3.1.) Normalize the new dataset such that the feature have the same range as they have in A ([-1, 1]).
    • (3.2) Apply your trained model (step 2) on the normalized new set (3.1).
  4. As you have a one-to-one mapping between set B and its normalized version, then you can see what is the prediction on set B, based on predictions on normalized set B.

Note you do not need to have access to set B in advance (or such sets if they are hundreds of them). You normalize them, as soon as you are given one and you want to test your trained model on it.

Meysam Sadeghi
  • 1,483
  • 2
  • 17
  • 23
  • Of course, it works. But it is a dirty trick, as you always don't know the dataset B beforehand. – raghu May 03 '19 at 12:53
  • Please help me to understand the problem correctly. You are given dataset A, and trained a model on it and then you want to use this model on other dataset (but with different feature ranges)? Or something else is the question? – Meysam Sadeghi May 03 '19 at 12:58
  • Yes, you understood it right. The thing you missed is, there are over a 1000 different datasets like B, which you can't include in preprocessing. Sometimes you won't know they exist until someone tests on your model. The general the scaling is, the better the model works on different datasets, as anyway neural networks are good in generalization. – raghu May 03 '19 at 13:06
  • Maybe my write up was not clear. I edited that. Basically, you do not need to know or include the B set (or sets) in preprocessing. You just normalize set A, train on that. And once you are given a new set, you normalize it on the go. – Meysam Sadeghi May 03 '19 at 13:19
  • I am sorry but you seem to overlook some things in my question. The solution you posted is already in my question. To be clear you are still talking about `A_test` rather than `B`. – raghu May 03 '19 at 13:29
  • Then, what if you do a grid search to find the best scaling and debiasing. – Meysam Sadeghi May 07 '19 at 06:08
  • Yeah i did on few scalars and finally settled for `QuantileTransformer`. – raghu May 08 '19 at 07:03