I am using Rpart{} to build a decision tree for a categorical variable and I am wondering whether I should use the full data set of just the set of unique rows.
2 Answers
I am answering this as a general question on decision trees, rather than on the R implementation.
The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting a weight on the values in those rows.
This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.
In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.

- 1,242,037
- 58
- 646
- 786
-
1Thank you Gordon, your answer is insightful. I've found that duplicate data could be a series problem for time series classification using a shifted window approach. A lot of dups create *gateways* to a class and if some of these dups permeate into testing, then they return a flawed performance measure. A DT trained on dups could potentially fail on generalisability. – Rakshit Kothari Aug 27 '18 at 23:03
In some ways this would depend on the data itself. Are the duplicated rows valid data? Or are they only partly duplicated but still important?
If the data were temperature measurements in a town at a given hour, maybe duplicated temperatures are important as they would weight this variable to be a more correct temperature than another lone measurement that was different.
If the data were temperature measurements that three people recorded off the same thermometer at the same time, then you would want to remove the noise from the data by reducing to just unique values.
The answer could very well be a combination of the above. If you had multiple readings that conflicted at the same time period, you would choose the most heavily weighted one, and then decide how to break ties, if all the measurements were the same you removed duplicates. In this way you are cleaning the data before you put it through an algorithm.
It all comes down to what is relevant in the data model and whether duplicated rows are of relevance to the result.

- 954
- 2
- 10
- 21