Does it make sense to use a part of the dataset to train my model?

Question

The dataset I have is a set of quotations that were presented to various customers in order to sell a commodity. Prices of commodities are sensitive and standardized on a daily basis and therefore negotiations are pretty tricky around their prices. I'm trying to build a classification model that had to understand if a given quotation will be accepted by a customer or rejected by a customer.

I made use of most classifiers I knew about and XGBClassifier was performing the best with ~95% accuracy. Basically, when I fed an unseen dataset it was able to perform well. I wanted to test how sensitive is the model to variation in prices, in order to do that, I synthetically recreated quotations with various prices, for example, if a quote was being presented for $30, I presented the same quote at $5, $10, $15, $20, $25, $35, $40, $45,..

I expected the classifier to give high probabilities of success as the prices were lower and low probabilities of success as the prices were higher, but this did not happen. Upon further investigation, I found out that some of the features were overshadowing the importance of price in the model and thus had to be dealt with. Even though I dealt with most features by either removing them or feature engineering them to better represent them I was still stuck with a few features that I just cannot remove (client-side requirements)

When I checked the results, it turned out the model was sensitive to 30% of the test data and was showing promising results, but for the rest of the 70% it wasn't sensitive at all.

This is when the idea struck my mind to feed only that segment of the training data where price sensitivity can be clearly captured or where the success of the quote is inversely related to the price being quoted. This created a loss of about 85% of the data, however the relationship that I wanted the model to learn was being captured perfectly well.

This is going to be an incremental learning process for the model, so each time a new dataset comes, I'm thinking of first evaluating it for the price sensitivity and then feeding in only that segment of the data for training which is price sensitive.

Having given some context to the problem, some of the questions I had were:

Does it make sense to filter out the dataset for segments where the kind of relationship I'm looking for is being exhibited?
Post training the model on the smaller segment of the data and reducing the number of features from 21 to 8, the model accuracy went down to ~87%, however it seems to have captured the price sensitivity bit perfectly. The way I evaluated price sensitivity is by taking the test dataset and artificially adding 10 rows for each quotation with varying prices to see how the success probability changes in the model. Is this a viable approach to such a problem?

score 1 · Answer 1 · answered Aug 14 '19 at 04:27

To answer your first question, deleting the part of the dataset that doesn't work is not a good idea because then your model will overfit on the data that gives better numbers. This means that the accuracy will be higher, but when presented with something that is slightly different from the dataset, the probability of the network adapting is lower.

To answer the second question, it seems like that's a good approach, but again I'd recommend keeping the full dataset.

Does it make sense to use a part of the dataset to train my model?

1 Answers1