creating cumulative percentage from a dictionary of data

Question

Given a dictionary (or Counter) of tally data like the following:

d={'dan':7, 'mike':2, 'john':3}

and a new dictionary "d_cump" that I want to contain cumulative percentages

d_cump={'mike':16, 'john':41, 'dan':100}

EDIT: Should clarify that order doesn't matter for my input set, which is why I'm using a dictionary or counter. Order does matter when calculating cumulative percentages so I need to sort the data for that operation, once I have the cumulative percentage for each name then I put it back in a dictionary since again, order shouldn't matter if I'm looking at single values.

What is the most elegant/pythonic way to get from d to d_cump?

Here is what I have seems a bit clumsy:

from numpy import cumsum
d={'dan':7, 'mike':2, 'john':3}
sorted_keys = sorted(d,key=lambda x: d[x])
perc = [x*100/sum(d.values()) for x in cumsum([ d[x] for x in sorted_keys ])]
d_cump=dict(zip(sorted_keys,perc))

>>> d_cump
{'mike': 16, 'john': 41, 'dan': 100}

Not an answer to your question, but `sorted = sorted(d,key=lambda x: d[x])` clobbers the builtin `sorted` and confuses readers who don't expect to see that. — Steven Rumbalski, Feb 01 '12 at 17:51
Also, if you are looking for data like a cumulative percentage which depends on ordering, then using a dictionary as your underlying data structure is a poor choice. — Steven Rumbalski, Feb 01 '12 at 17:55
thanks, changed to sorted_keys. I think the dictionary here will be useful (in the end) for fast and convenient lookups, I guess I could stick it into an array of tuples though. So I guess I can start with d.items() if that's better.. — user1183315, Feb 01 '12 at 17:59
Also assuming Counter() is frequently used for tallies, wouldn't generating cumulative percentages for those tallies be a common pattern? Sorta hoping that there is a something in the vast collection of python libs that would be helpful. — user1183315, Feb 01 '12 at 18:02
I would say no, it's not a common pattern. Tallies count things in unordered buckets. I don't think there is much value in imposing an order on those buckets that was not found in the original data. Consider the cumulative percentage of `Counter('gallahad')` is `{'a': 100, 'd': 37, 'g': 25, 'h': 12, 'l': 62}`. I don't think it tells us anything about the original data. — Steven Rumbalski, Feb 01 '12 at 18:48
Continuing your example if you are willing to work with me on this: Say I have two strings of letters, I tally the letter counts in c1 and c2. For c1 I have c1 = {'a':3, 'b':3, 'c':3} which gives me c1_cump = {'a': 33, 'c': 66, 'b': 100}. for the other one I have c2={'a':3, 'b':3, 'c':10} so c2_cump={a': 18, 'c': 100, 'b': 37}. Now I want to see whether 'b' is above the 60th percentile, I can see that it is for the c1 data, but it isn't for c2. — user1183315, Feb 01 '12 at 19:02

Steven Rumbalski · Accepted Answer · 2012-02-01T18:18:11.760

2

It's hard to tell how a cumulative percentage would be valuable considering the order of the original dictionary is arbitrary.

That said, here's how I would do it:

from numpy import cumsum
from operator import itemgetter

d={'dan':7, 'mike':2, 'john':3}

#unzip keys from values in a sorted order
keys, values = zip(*sorted(d.items(), key=itemgetter(1)))
total = sum(values)

# calculate cumsum and zip with keys into new dict
d_cump = dict(zip(keys, (100*subtotal/total for subtotal in cumsum(values))))

Note that there is no special order to the results because dictionaries are not ordered:

{'dan': 100, 'john': 41, 'mike': 16}

edited Feb 01 '12 at 18:18

answered Feb 01 '12 at 18:11

Steven Rumbalski

44,786
9
89
119

Example of how it might be useful: Say I want to see whether "john" is above or below the 60th percintile, I would check by d_cump['john'], that's convenient right? Say I'm starting with a Counter() to tally data, would it be better to start with d.items()? – user1183315 Feb 01 '12 at 18:44
Same response as the above question for how I'm trying to make this useful, Say I have two strings of letters, I tally the letter counts in c1 and c2. For c1 I have c1 = {'a':3, 'b':3, 'c':3} which gives me c1_cump = {'a': 33, 'c': 66, 'b': 100}. for the other one I have c2={'a':3, 'b':3, 'c':10} so c2_cump={a': 18, 'c': 100, 'b': 37}. Now I want to see whether 'b' is above the 60th percentile, I can see that it is for the c1 data, but it isn't for c2. That's basically what I'm trying to get at but I'm an idiot when it comes to basic statistics so any suggestions are welcome. – user1183315 Feb 01 '12 at 19:04
Hey down-voter, tell me why and if I agree I'll be happy to edit my answer or even delete it. – Steven Rumbalski Feb 01 '12 at 19:17
You know what, now that i've thought about this more I concede your point that this is weird and that i shouldn't be using a dictionary for this. – user1183315 Feb 01 '12 at 19:25

DSM · Answer 2 · 2012-02-01T19:23:29.987

Since you're using numpy anyway, you can bypass/simplify the list comprehensions:

>>> from numpy import cumsum
>>> d={'dan':7, 'mike':2, 'john':3}
>>> sorted_keys = sorted(d,key=d.get)
>>> z = cumsum(sorted(d.values())) # or z = cumsum([d[k] for k in sorted_keys])
>>> d2 = dict(zip(sorted_keys, 100.0*z/z[-1]))
>>> d2
{'mike': 16, 'john': 41, 'dan': 100}

but as noted elsewhere, it feels weird to be using a dictionary this way.

Marcin · Answer 3 · 2012-02-02T01:13:02.490

Calculating a cumulative value? Sounds like a fold to me!

d = {'dan':7, 'mike':2, 'john':3}
denominator = float(sum(d.viewvalues()))
data = ((k,(v/denominator)) for (k, v) in sorted(d.viewitems(), key = lambda (k,v):v))


import functional
f = lambda (k,v), l : [(k, v+l[0][1])]+l
functional.foldr(f, [(None,0)], [('a', 1), ('b', 2), ('c', 3)])
#=>[('a', 6), ('b', 5), ('c', 3), (None, 0)]

d_cump = { k:v for (k,v) in functional.foldr(f, [(None,0)], data) if k is not None }

Functional isn't a built-in package. You could also re-jig f to work with a right-fold, and hence the standard reduce if you wanted.

As you can see, this isn't much shorter, but it takes advantage of sequence destructuring to avoid splitting/zipping, and it uses a generator as the intermediate data, which avoids building a list.

If you want to further minimise object creation, you can use this alternative function which modifies the initial list passed in (but has to use a stupid trick to return the appropriate value, because list.append returns None).

uni = lambda x:x
ff = lambda (k,v), l : uni(l) if l.insert(0, (k, v+l[0][1])) is None else uni(l)

Incidentally, the left fold is very easy using ireduce (from this page http://www.ibm.com/developerworks/linux/library/l-cpyiter/index.html ), because it eliminates the list construction:

ff = lambda (l, ll), (k,v), : (k, v+ll)
g = ireduce(ff, data, (None, 0))

tuple(g)
#=>(('mike', 0.16666666666666666), ('john', 0.41666666666666663), ('dan', 1.0))

def ireduce(func, iterable, init=None):
    if init is None:
        iterable = iter(iterable)
        curr = iterable.next()
    else:
        curr = init
    for x in iterable:
        curr = func(curr, x)
        yield curr

This is attractive because the initial value is not included, and because generators are lazy, and so particularly suitable for chaining.

Note that ireduce above is equivalent to:

def ireduce(func, iterable, init=None):
    from functional import scanl
    if init is None: init = next(iterable)
    sg = scanl(func, init, iterable)
    next(sg)
    return sg

Hm, well you've touched on an area of programming that I know nothing about, I'm going to read a little bit about folds now. — user1183315, Feb 01 '12 at 19:18

creating cumulative percentage from a dictionary of data

3 Answers3

Linked