0

My dataset is > 300mil rows so I think spark-ml would be a better choice than sklearn. Since there are many rows with same set of features (they are different data points), I furtherly aggregate the dataset by feature and target and produce a weight column which is the aggregated count. I know in sklearn sample weight will be directly incorporated into the impurity formula. Take binary classification and gini impurity as an example:

gini = 1 - p_0^2 - p_1^2, where p_0 = n_0 / (n_0+n_1) and p_1 = 1 - p_0

With sample weights, p_0 will be calculated as:

p_0 = sum(w_0i) / (sum(w_0i) + sum(w_1j)) where w_0i is the weight of ith negative sample.

However, looking into the source code of spark ml, it seems the sample weight is not used in calculating class probability in a node. It's only used after the split when impurities of both left and right node are produced to reweight for the total impurity so a highly weighted positive example will not increase the postive probability, instead it only adds to total weight of a node. This is really counter-intuitive and might not be useful for my purpose. So I'm here looking for some expert on this topic to enlighten me of whether I made a wrong observation or this kind of reweighting is actually meaningful.

0 Answers0