6

I have a sorteddict and I am interested in the cumulative sum of the values:

>>> from blist import sorteddict
>>> import numpy as np
>>> x = sorteddict({1:1, 2:2, 5:5})
>>> zip(x.keys(), np.cumsum(x.values())) 
[(1, 1), (2, 3), (5, 8)]

However, I frequently need to update the dictionary and so need to recalculate the cumulative sum:

>>> x[4] = 4
>>> zip(x.keys(), np.cumsum(x.values()))
[(1, 1), (2, 3), (4, 7), (5, 12)]

>>> x[3] = 3
>>> zip(x.keys(), np.cumsum(x.values()))
[(1, 1), (2, 3), (3, 6), (4, 10), (5, 15)]

I'm wondering whether instead of constantly recalculating the cumulative sum, there is some clever way of maintaining the cumulative sum efficiently?

Note

>>> import sys
>>> sys.version
'2.7.11 (default, Jun 15 2016, 17:53:20) [MSC v.1800 32 bit (Intel)]'

Also in general my keys and values will not be the same -- I was just lazy in my example

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
mchen
  • 9,808
  • 17
  • 72
  • 125
  • Are the keys numbers? Is `sorteddict` a must? What complexity requirement do you have on insertion? – kabanus May 01 '17 at 10:34
  • @kabanus The keys are numbers. A `sorteddict` is not a must but my data comes in (key, value) pairs and the cumulative sum over values should be performed according to the order of the keys. My only requirement is that time complexity should be less than constantly recalculating the cumulative sum. – mchen May 01 '17 at 10:35
  • What do you need the sums for? Is it imperative to update immediately, or do you need it on demand? You will have eventually need to propagate the change upstream, so the use case is important here - I'm guessing you want O(1) when requesting a sum. – kabanus May 01 '17 at 11:18
  • @kabanus I need it immediately after every update. Usually I only need to update a single (key, value) pair, but sometimes I need to update a small batch of (key, value) pairs. When I update in batch I would only need the cumulative sum after the whole batch has been written. – mchen May 01 '17 at 11:22
  • This is a difficult problem then - you want both immediate insertion and immediate calculation. I think you may have to relinquish a bit - either slow down the insertion or the access to the sums. In any case a small optimization is to maintain the dictionary yourself, and add the cumulative sums only upstream during an insertion. Assuming on average insertion/change is from the middle and onward this will cut time in half. – kabanus May 01 '17 at 11:23

1 Answers1

0

How about this:

import collections

def add_and_regenerate_sums(term, master):
    index, value = term
    master[index] = value
    master = collections.OrderedDict(sorted(master.items(), key=lambda z: z[0]))
    y = dict()
    sum_of = 0
    for i, j in master.items():
        sum_of += j
        y[i] = sum_of

    return dict(sorted(y.items(), key=lambda z: z[0])), master

x = collections.OrderedDict({1:1, 2:2, 5:5})

sums, master = add_and_regenerate_sums((3, 10), x)
print(sums)
print(master)

You then can get the sums based on the addition as well as the new dictionary to operate on later.

Utkonos
  • 631
  • 6
  • 21