Use numpy structured array instead of dict to save space and keep speed

Question

Are numpy structured arrays an alternative to Python dict?

I would like to save memory and I cannot affort much of a performance decline.

In my case, the keys are str and the values are int.

Can you give a quick conversion line in case they actually are an alternative?

I also don't mind if you can suggest a different alternative.

I need to save memory, because some dictionaries get larger than 50Gb in memory and I need to open multiple at a time with 'only' 192 GB RAM available.

What is your use case? How big is your `dict`? How much memory do you need to save, and why? What performance goals are you trying to hit? Do you need key-based lookup? This question is too vague to answer as is. — tzaman, Sep 24 '19 at 09:55
If you are using your dictionary like an array, then maybe, but if you are relying on fast look-ups by key, then probably no. — juanpa.arrivillaga, Sep 24 '19 at 10:42
The structured array that you propose is (in effect) just two arrays of the same size, one with string elements, and other with numeric. — hpaulj, Sep 24 '19 at 16:22
Because I need O(n) time to search in the string array and get the index to address the integer array? — Radio Controlled, Sep 24 '19 at 16:31
Because here (https://stackoverflow.com/a/35485100/1367655) I asked what is the time complexity and someone said it's O(1)... — Radio Controlled, Sep 24 '19 at 16:32

score 4 · Answer 1 · answered May 10 '21 at 08:54

Maybe a bit late, but in case others have the same question, I did a simple benchmarking:

In [1]: import random

In [2]: import string

In [3]: import pandas as pd

In [4]: import sys

In [5]: size = 10**6

In [6]: d = {''.join(random.choices(string.ascii_letters + string.digits, k=32)): random.randrange(size) for _ in range(size)}

In [7]: s = pd.Series(d)

In [8]: a = s.values.view(list(zip(s.index, ['i8'] * size)))

In [9]: key = s.index[random.randrange(size)]

In [10]: %timeit d[key]
61.5 ns ± 1.46 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [11]: %timeit s[key]
3.54 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [12]: %timeit a[key]
154 ns ± 1.63 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [13]: sys.getsizeof(d)
Out[13]: 41943136

In [14]: sys.getsizeof(s)
Out[14]: 130816632

It seems that np.ndarray is about 2-3 times slower than dict (which is acceptable), while the performance of pd.Series is much worse than the other two. As for space efficiency, dict also outperforms pd.Series. I didn't find a way to get the memory usage of a numpy structured array (tried with sys.getsizeof and ndarray.nbytes, but it seems both are missing the size of fields).

You could use a very large mapping and inspect in htop or the like to see how much memory is allocated at initialization... — Radio Controlled, May 15 '21 at 13:31

score 1 · Answer 2 · answered Sep 24 '19 at 10:14

numpy array use contiguous block of memory and can store only one type of object like int, float, string or other object. Where each item are allocated fixed bytes in memory.

Numpy also provide set of functions for operation like traversing array, arithmetic operation, some string operation on those stored items which are implemented using c. As these operation doesn't have overhead of python they are normally more efficient in terms of both memory and processing power

As you need key value pair you can also store that in numpy array similar to c-struct but it won't have features like dict like looking of item, checking if key existing filtering etc. you have do do those your self using array functionality

better option for you may be pandas series, which also use numpy array to store its data for provides you lots of functionality on top of it

Yeah, a `pandas.DataFrame` is probably the child of `dict` and `ndarray` that the OP is looking for. — Daniel F, Sep 24 '19 at 10:20
A clarification on numpy arrays. Numpy has (at least) 3 types of arrays: ndarrays, record arrays, and structured arrays. It's true, ndarrays can only store one object type. However, both record arrays and structured arrays can store objects of different types (as defined with the dtype). You can index objects using the field names, similar to a dictionary. They are worth investigating if you have to handle a large number of data types. — kcw78, Sep 24 '19 at 21:58

score 0 · Answer 3 · answered Sep 24 '19 at 10:01

You can use object's in-memory size using sys

import sys
import numpy as np

x = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
    dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])

print(sys.getsizeof(x))

y = {
    1 : {
        'name': 'Rex',
        'age' : 9,
        'weight' : 81.0
    },
    2 : {
        'name': 'Fido',
        'age' : 3,
        'weight' : 27.0
    }
}

print(sys.getsizeof(y))

192
240

These are the results. I believe numpy structured array are relatively memory efficient.

this is extremely underestimating the total size of that dict — juanpa.arrivillaga, Sep 24 '19 at 10:41

Use numpy structured array instead of dict to save space and keep speed

3 Answers3