3

I have an interesting Javascript task (performed in Node.js, FWIW): I need to take the "weighted median" of a dataset for which I have values (income, in this case) and a weight for each one. For example:

income  #people
0   5
16000   3
20000   8
32000   4
40000   3
41000   1
50000   2
90000   1

In other words, 8 people make $20K, 2 make $50K, etc. I need the "weighted median" -- the median of all 27 people.

The naive way to do this would be to make an array and seed it with every value, like so:

var incomes = [0, 0, 0, 0, 0, 16000, 16000, 16000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 20000, 32000, 32000, 32000, 32000, 40000, 40000, 40000, 41000, 50000, 50000, 90000];

One can then easily take the median of this array (which is $20,000). In reality, I have data for between 7,000 and 14,000 people per sample. While I'm sure Node could handle an array this large, it feels incredibly sloppy.

My current solution is to calculate the index of the median value in the hypothetical verbose array -- 13, in this case -- and the loop through the array of incomes and weights, adding up the cumulative weight until it reaches or surpasses the halfway point. Here's a simplified example. (Obviously, medians require slightly different rules for even-numbered lists. This is just a POC.)

var halfway = 13,
    progress = 0;

var vals = [[0,5], [16000,3], [20000,8], [32000,4], [40000,3], [41000,1], [50000,2], [90000,1]];

for (var v = 0; v < vals.length; v += 1) {
    progress += vals[v][1];

    if (progress >= halfway) {
        var median = vals[v][0];
        break;
    }
}

This works ok, but it gets messy when you want to start calculating quartiles and so forth. What would be easier is for me to be able to create a sparse array of the values at their appropriate place in the verbose array without filling in all the intermediate values, then perform lookups on this array for any index up to the maximum. But I need some efficient mechanism for finding the previous known index in the sparse array if (as is likely) the index I'm looking for in the spare array isn't populated.

This seems like it must be a fairly common problem.

Chris Wilson
  • 6,599
  • 8
  • 35
  • 71
  • Dumb question but: Did you tried with `vals.forEach(function (element, i, array){});` – DrakaSAN Sep 24 '14 at 15:30
  • Not sure what that would accomplish different from looping the old-fashioned way, like I do above. (I did it that way since it's easier to break out of.) – Chris Wilson Sep 24 '14 at 15:35
  • How many different incomes would you need to handle? Would many of the thousands of people have different incomes? – parchment Sep 24 '14 at 15:41
  • Potentially thousands, yes -- it's Census data. Not the most natural task for Node, I realize, but it's what I know. – Chris Wilson Sep 24 '14 at 16:03
  • Why wouldn't you use an object where the key was the income amount and the value was the number of people that had that exact income. You can iterate over all the incomes by just iterating the properties of the object and you have instant lookup access to any income value and the count of people with that income. It's also more compact to store that duplicating income values for each person. – jfriend00 Sep 24 '14 at 17:21
  • 1
    How would I get from that solution to the median value? – Chris Wilson Sep 24 '14 at 17:55

1 Answers1

1

In terms of computational efficiency I think what you have is about as good as your gonna get, though I'm not sure what difficulties your facing with quartiles (sorry too low rep to ask for clarification on that).

Lets start by looking at the efficiency of what you have. You have an array of length n and you step through it adding to a counter and breaking halfway (I assume that halfway info was given, again sorry too low to ask). So not bad were looking at a simple O(n).

Now what you propose is some form of lookup that given an index knows the nearest populated index, O(1). Well that would be better, so lets look into what we would need for that. Well we would need to move the given data into some new data structure by looping through it.....Oh oops that means were back to O(n).

Moral of the story what you have is good, good work.