I have an interesting Javascript task (performed in Node.js, FWIW): I need to take the "weighted median" of a dataset for which I have values (income, in this case) and a weight for each one. For example:
income #people
0 5
16000 3
20000 8
32000 4
40000 3
41000 1
50000 2
90000 1
In other words, 8 people make $20K, 2 make $50K, etc. I need the "weighted median" -- the median of all 27 people.
The naive way to do this would be to make an array and seed it with every value, like so:
var incomes = [0, 0, 0, 0, 0, 16000, 16000, 16000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 32000, 32000, 32000, 32000, 40000, 40000, 40000, 41000, 50000, 50000, 90000];
One can then easily take the median of this array (which is $20,000). In reality, I have data for between 7,000 and 14,000 people per sample. While I'm sure Node could handle an array this large, it feels incredibly sloppy.
My current solution is to calculate the index of the median value in the hypothetical verbose array -- 13, in this case -- and the loop through the array of incomes and weights, adding up the cumulative weight until it reaches or surpasses the halfway point. Here's a simplified example. (Obviously, medians require slightly different rules for even-numbered lists. This is just a POC.)
var halfway = 13,
progress = 0;
var vals = [[0,5], [16000,3], [20000,8], [32000,4], [40000,3], [41000,1], [50000,2], [90000,1]];
for (var v = 0; v < vals.length; v += 1) {
progress += vals[v][1];
if (progress >= halfway) {
var median = vals[v][0];
break;
}
}
This works ok, but it gets messy when you want to start calculating quartiles and so forth. What would be easier is for me to be able to create a sparse array of the values at their appropriate place in the verbose array without filling in all the intermediate values, then perform lookups on this array for any index up to the maximum. But I need some efficient mechanism for finding the previous known index in the sparse array if (as is likely) the index I'm looking for in the spare array isn't populated.
This seems like it must be a fairly common problem.