Quickly convert numpy arrays with index to dict of numpy arrays keyed on that index

Question

I have a set of numpy arrays. One of these is a list of "keys", and I'd like to rearrange the arrays into a dict of arrays keyed on that key. My current code is:

for key, val1, val2 in itertools.izip(keys, vals1, vals2):
    dict1[key].append(val1)
    dict2[key].append(val2)

This is pretty slow, since the arrays involved are millions of entries long, and this happens many times. Is it possible to rewrite this in vectorized form? The set of possible keys is known ahead of time, and there are ~10 distinct keys.

Edit: If there are k distinct keys and the list is n long, the current answers are O(nk) (iterate once for each key) and O(n log n) (sort first). I'm still looking for an O(n) vectorized solution, though. This is hopefully possible; after all, the easiest possible nonvectorized thing (i.e. what I already have) is O(n).

I think Pandas has tools for this kind of thing, but you're not going to have much luck with pure NumPy. — user2357112, Sep 02 '16 at 23:05
@knzhou: I have an implementation that is O(n log n), but even with 10 keys and 20 million entries, it is almost four times faster than your O(n) solution. Are you really not interested? — Warren Weckesser, Sep 03 '16 at 00:01
You say there are ~10 distinct keys. What is the data type of the keys? — Warren Weckesser, Sep 03 '16 at 00:25
@WarrenWeckesser They're strings, though that could be changed. — knzhou, Sep 03 '16 at 00:25
It isn't really necessary, but it could eliminate a bit of "bookkeeping". — Warren Weckesser, Sep 03 '16 at 00:28
By the way, "big O" notation is asymptotic behavior. For such a small number of keys, there is no point in worrying about the number of keys when considering the asymptotic time complexity. For fixed k, O(kn) and O(n) are the same thing. — Warren Weckesser, Sep 03 '16 at 00:34
Thinking about it a bit more, an O(n) vectorized solution actually might be _slower_, because it probably wouldn't have good locality. All the other solutions here do all reads/writes sequentially in memory. — knzhou, Sep 03 '16 at 00:37
So I'll just take whichever one of the solutions is fastest. — knzhou, Sep 03 '16 at 00:37
You might find http://stackoverflow.com/questions/34179968/are-there-any-cases-where-you-would-prefer-a-higher-big-o-time-complexity-algori interesting. — Warren Weckesser, Sep 03 '16 at 00:46

John1024 · Answer 1 · 2016-09-02T23:37:28.853

3

Let's import numpy and create some sample data:

>>> import numpy as np
>>> keys = np.array(('key1', 'key2', 'key3', 'key1', 'key2', 'key1'))
>>> vals1 = np.arange(6)
>>> vals2 = np.arange(10, 16)

Now, let's create the dictionary:

>>> d1 = {}; d2 = {}
>>> for k in set(keys):
...   d1[k] = vals1[k==keys]
...   d2[k] = vals2[k==keys]
... 
>>> d1
{'key3': array([2]), 'key2': array([1, 4]), 'key1': array([0, 3, 5])}
>>> d2
{'key3': array([12]), 'key2': array([11, 14]), 'key1': array([10, 13, 15])}

The idea behind numpy is that C code is much faster than python and numpy provides many common operations coded at the C level. As you mentioned that there were only "~10 distinct keys," that means that the python loop is done only 10 or so times. The rest is C.

edited Sep 02 '16 at 23:37

answered Sep 02 '16 at 23:07

John1024

109,961
14
137
171

This appears to iterate through the vals arrays many times, though. Is there a way to do this in one pass? – knzhou Sep 02 '16 at 23:08
1

In this version, the iteration through `vals` is done at the level of C-code, not python code. That makes it "fast". – John1024 Sep 02 '16 at 23:12
This is only "fast" for small numbers of keys. The time complexity of this method is O(n * k) where n is the size of keys and k is the number of unique keys. If k is large, this method will be slower than the naive implementation (which has complexity of O(n)). – Bi Rico Sep 02 '16 at 23:49
1

_"This is only 'fast' for small numbers of keys._" The small number of keys was part of the OP's problem specification. – John1024 Sep 03 '16 at 00:00

Bi Rico · Accepted Answer · 2016-09-03T00:00:10.943

The vectorized way to do this is probably going to require you to sort your keys. The basic idea is that you sort the keys and vals to match. Then you can split the val array every time there is a new key in the sorted keys array. The vectorized code looks something like this:

import numpy as np

keys = np.random.randint(0, 10, size=20)
vals1 = np.random.random(keys.shape)
vals2 = np.random.random(keys.shape)

order = keys.argsort()
keys_sorted = keys[order]

# Find uniq keys and key changes
diff = np.ones(keys_sorted.shape, dtype=bool)
diff[1:] = keys_sorted[1:] != keys_sorted[:-1]
key_change = diff.nonzero()[0]
uniq_keys = keys_sorted[key_change]

vals1_split = np.split(vals1[order], key_change[1:])
dict1 = dict(zip(uniq_keys, vals1_split))

vals2_split = np.split(vals2[order], key_change[1:])
dict2 = dict(zip(uniq_keys, vals2_split))

This method has complexity O(n * log(n)) because of the argsort step. In practice, argsort is very fast unless n is very large. You're likely going to run into memory issues with this method before argsort gets noticeably slow.

This is O(n log n), though. I guess you can use a radix sort instead, but that just gets us to O(nk), and I think in that case the other answer has a better constant. — knzhou, Sep 02 '16 at 23:52
I suggest you profile some of these methods, every time I profile I realize I know less than I think. — Bi Rico, Sep 02 '16 at 23:57

score 2 · Answer 3 · answered Sep 03 '16 at 00:34

Some timings:

import numpy as np
import itertools

def john1024(keys, v1, v2):
  d1 = {}; d2 = {};
  for k in set(keys):
    d1[k] = v1[k==keys]
    d2[k] = v2[k==keys]
  return d1,d2

def birico(keys, v1, v2):
  order = keys.argsort()
  keys_sorted = keys[order]
  diff = np.ones(keys_sorted.shape, dtype=bool)
  diff[1:] = keys_sorted[1:] != keys_sorted[:-1]
  key_change = diff.nonzero()[0]
  uniq_keys = keys_sorted[key_change]
  v1_split = np.split(v1[order], key_change[1:])
  d1 = dict(zip(uniq_keys, v1_split))
  v2_split = np.split(v2[order], key_change[1:])
  d2 = dict(zip(uniq_keys, v2_split))
  return d1,d2

def knzhou(keys, v1, v2):
  d1 = {k:[] for k in np.unique(keys)}
  d2 = {k:[] for k in np.unique(keys)}
  for key, val1, val2 in itertools.izip(keys, v1, v2):
    d1[key].append(val1)
    d2[key].append(val2)
  return d1,d2

I used 10 keys, 20 million entries:

import timeit

keys = np.random.randint(0, 10, size=20000000) #10 keys, 20M entries
vals1 = np.random.random(keys.shape)
vals2 = np.random.random(keys.shape)

timeit.timeit("john1024(keys, vals1, vals2)", "from __main__ import john1024, keys, vals1, vals2", number=3)
11.121668815612793
timeit.timeit("birico(keys, vals1, vals2)", "from __main__ import birico, keys, vals1, vals2", number=3)
8.107877969741821
timeit.timeit("knzhou(keys, vals1, vals2)", "from __main__ import knzhou, keys, vals1, vals2", number=3)
51.76217794418335

So, we see than the sorting technique is a bit faster than letting Numpy find the indices corresponding to each key, but of course both are much much faster than looping in Python. Vectorization is great!

This is on Python 2.7.12, Numpy 1.9.2

Well, I'm sold! birico's solution it is. – knzhou Sep 03 '16 at 00:36 — knzhou, Sep 03 '16 at 00:36

score 0 · Answer 4 · answered Sep 03 '16 at 00:16

defaultdict is intended for building dictionaries like this. In particular is streamlines the step of creating a new dictionary entry for a new key.

In [19]: keys = np.random.choice(np.arange(10),100)
In [20]: vals=np.arange(100)
In [21]: from collections import defaultdict
In [22]: dd = defaultdict(list)
In [23]: for k,v in zip(keys, vals):
    ...:     dd[k].append(v)
    ...:     
In [24]: dd
Out[24]: 
defaultdict(list,
            {0: [4, 39, 47, 84, 87],
             1: [0, 25, 41, 46, 55, 58, 74, 77, 89, 92, 95],
             2: [3, 9, 15, 24, 44, 54, 63, 66, 71, 80, 81],
             3: [1, 13, 16, 37, 57, 76, 91, 93],
             ...
             8: [51, 52, 56, 60, 68, 82, 88, 97, 99],
             9: [21, 29, 30, 34, 35, 59, 73, 86]})

But with a small known set of keys you don't need this specialized dictionary, since you can easily create the dictionary key entries ahead of time

dd = {k:[] for k in np.unique(keys)}

But since you are starting with arrays, array operations to sort and collect like values might well be worth it.

Quickly convert numpy arrays with index to dict of numpy arrays keyed on that index

4 Answers4