2

Say we have a list of instances of a class, which all have an attribute that we know is a float -- call the attribute x. At various points in a program, we want to extract a numpy array of all values of x for running some analysis on the distribution of x. This extraction process is done a lot, and it's been identified as a slow part of the program. Here is an extremely simple example to illustrate specifically what I have in mind:

import numpy as np

# Create example object with list of values
class stub_object(object):
    def __init__(self, x):
        self.x = x

# Define a list of these fake objects
stubs = [stub_object(i) for i in range(10)]

# ...much later, want to quickly extract a vector of this particular attribute:
numpy_x_array = np.array([a_stub.x for a_stub in stubs])

Here's the question: is there a clever, faster way to track the "x" attribute across instances of stub_object in the "stubs" list, such that constructing the "numpy_x_array" is faster than the process above?

Here's a rough idea I am trying to hammer out: can I create a "global to the class type" numpy vector, which will update as the set of objects updates, but I can operate on efficiently any time I want?

All I am really looking for is a "nudge in the right direction." Providing keywords I can google / search SO / docs further is exactly what I am looking for.

For what it is worth, I've looked into these, which have gotten me a little further but not completely there:

Others I looked at, which were not as helpful:

(One option, of course, is to "simply" overhaul the structure of the code, such that instead of a "stubs" list of "stub_objects," there is one large object, something like stub_population, which maintains the relevant attributes in lists and/or numpy arrays, and methods that simply act on elements of those arrays. The downside to that is lots of refactoring, and some reduction of the abstraction and flexibility of modeling the "stub_object" as its own thing. I'd like to avoid this if there is a clever way to do so.)

Edit: I am using 2.7.x

Edit 2: @hpaulj, your example has been a big help -- answer accepted.

Here's the extremely simple first-pass version of the example code above that is doing what I want. There are very prelim indications of possible one order-magnitude speedup, without significant rearrangement of code body. Excellent. Thanks!

size = 20

# Create example object with list of values
class stub_object(object):
    _x = np.zeros(size, dtype=np.float64)

    def __init__(self, x, i):
        # A quick cop-out for expanding the array:
        if i >= len(self._x):
            raise Exception, "Index i = " +str(i)+ " is larger than allowable object size of len(self._x) = "+ str(self._x)
        self.x = self._x[i:i+1]
        self.set_x(x)

    def get_x(self):
        return self.x[0]

    def set_x(self, x_new):
        self.x[0] = x_new

# Examine:

# Define a list of these fake objects
stubs = [stub_object(x=i**2, i) for i in range(size)]

# ...much later, want to quickly extract a vector of this particular attribute:
#numpy_x_array = np.array([a_stub.x for a_stub in stubs])

# Now can do: 
numpy_x_array = stub_object._x  # or
numpy_x_array = stubs[0]._x     # if need to use the list to access

Not using properties yet, but really like that idea a lot, and it should go a long way in making code very close to unchanged.

Community
  • 1
  • 1
CompEcon
  • 1,994
  • 1
  • 14
  • 12
  • Mmm, I have an idea that I want to test but one thing isn't quite clear to me from your question; is `x` a variable? You mention wanting methods to act on this array of attributes, and you want the array to be updated as new instances are made, but is it also a requirement that `x` may be modified and the change should be reflected in the array? I'm not sure I can incorporate that last part if so. – roganjosh Apr 20 '17 at 15:41
  • Yes, as you say, it is a requirement that x may be modified and that change should be reflected in the array. That's the part that makes it tricky. Also, yes, x is a variable - a float, in fact (see first line, which may have not been worded clearly enough). – CompEcon Apr 20 '17 at 18:54

1 Answers1

3

The basic problem is that your objects are stored through out memory, with the attribute in each object's dictionary. But for array work, the values have to be stored in a contiguous databuffer.

I've explored this in other SO questions, but the ones you found are earlier. Still I don't have a great deal to add.

np.array([a_stub.x for a_stub in stubs])

The alternatives using itertools or fromiter shouldn't change speed much because the time consumer is a_stub.x access, not so much the iteration mechanism. You could verify that by testing against something simpler like

np.array([1 for _ in range(len(stubs))]

I suspect the best option is to use one or more arrays as the primary storage, and refactor your class so that the attribute is fetched from that storage.

If you know you'll have 10 objects, then make an empty array of that size. When you create the object you assign it a unique index. The x attribute can be a property who's getter/setter accesses the data[i] element of that array. By making x a property instead of a primary attribute, you should be able to keep most of the object machinery. And you can experiment with different storage methods by simply changing a couple of methods.

I was trying to sketch this out using a class attribute as the primary array storage, but I still have some bugs.


Class with x property that accesses an array:

class MyObj(object):
    xdata = np.zeros(10)
    def __init__(self,idx, x):
        self._idx = idx
        self.set_x(x)
    def set_x(self,x):
        self.xdata[self._idx] = x
    def get_x(self):
        return self.xdata[self._idx]
    def __repr__(self):
        return "<obj>x=%s"%self.get_x()    
    x = property(get_x, set_x)

In [67]: objs = [MyObj(i, 3*i) for i in range(10)]
In [68]: objs
Out[68]: 
[<obj>x=0.0,
 <obj>x=3.0,
 <obj>x=6.0,
 ...
 <obj>x=27.0]
In [69]: objs[3].x
Out[69]: 9.0
In [70]: objs[3].xdata
Out[70]: array([  0.,   3.,   6.,   9.,  12.,  15.,  18.,  21.,  24.,  27.])
In [71]: objs[3].xdata += 3
In [72]: [o.x for o in objs]
Out[72]: [3.0, 6.0, 9.0, 12.0, 15.0, 18.0, 21.0, 24.0, 27.0, 30.0]

In place change to the array is easiest. But it is also possible to replace the array itself (and thus 'grow' the class set)

In [79]: MyObj.xdata=np.ones((20,))    
In [80]: a = MyObj(11,25)
In [81]: a
Out[81]: <obj>x=25.0
In [82]: MyObj.xdata
Out[82]: 
array([  1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,
        25.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.])
In [83]: [o.x for o in objs]
Out[83]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

We have to careful about modifying attributes. For example I tried

objs[3].xdata += 3

intending to change xdata for the whole class. But this ended up assigning a new xdata array just for that object. We should also be able to auto-increment the object index (these days I'm more familiar with numpy methods than Python class structures).


If I replace the getter with one that fetches a slice:

 def get_x(self):
     return self.xdata[self._idx:self._idx+1]

In [107]: objs=[MyObj(i,i*3) for i in range(10)]
In [109]: objs
Out[109]: 
[<obj>x=[ 0.],
 <obj>x=[ 3.],
 ...
 <obj>x=[ 27.]]

np.info (or .__array_interface__) gives me information about the xdata array, including its databuffer pointer:

In [110]: np.info(MyObj.xdata)
class:  ndarray
shape:  (10,)
strides:  (8,)
itemsize:  8
aligned:  True
contiguous:  True
fortran:  True
data pointer: 0xabf0a70
byteorder:  little
byteswap:  False
type: float64

The slice for the 1st object, points to the same place:

In [111]: np.info(objs[0].x)
class:  ndarray
shape:  (1,)
strides:  (8,)
itemsize:  8
....
data pointer: 0xabf0a70
...

The next object points to the next float (8 bytes further):

In [112]: np.info(objs[1].x)
class:  ndarray
shape:  (1,)
...
data pointer: 0xabf0a78
....

I'm not sure that access by slice/view is worth it or not.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • "I was trying to sketch this out using a class attribute as the primary array storage, but I still have some bugs." is exactly what I was thinking of trying, with another as a counter for the index in the array. However, I didn't consider the pre-determined array size as an issue, I was just going to add to the array; would that crumble the approach? – roganjosh Apr 20 '17 at 16:14
  • Actually, am I right in thinking that adding to an array of non-pre-determined size would just incur the overhead of creating a copy? In the long-run, that would still be faster than using a list comp to create a list of the attributes every time it was needed? – roganjosh Apr 20 '17 at 16:23
  • 1
    I reworked the sample so the array is a class attribute, and can be changed, copied or grown without breaking the property mechanism. – hpaulj Apr 20 '17 at 16:47
  • Appreciated, your approach is more comprehensive than I had envisioned in my own approach. Unfortunately, I cannot upvote a second time but this is a really interesting crossover between python and numpy that I think I can use elsewhere as an API bridge directly to numpy. – roganjosh Apr 20 '17 at 16:52
  • Just out of curiosity, did you try to vectorize `operator.attrgetter`? – hilberts_drinking_problem Apr 20 '17 at 19:12
  • 1
    `attrgetter` is a Python class that lets you fetch several attributes at once from a single object. It doesn't use any compiled speedups, and doesn't work on a list of objects. – hpaulj Apr 20 '17 at 19:30
  • To add one comment -- I *will* know the size that xdata needs to be when I write the class. I won't need to change that. – CompEcon Apr 20 '17 at 20:29
  • And one more note: this answer, which uses pointers to spots in an ndarray, makes me feel like I am close: http://stackoverflow.com/questions/6480310/ctypes-pointer-into-the-middle-of-a-numpy-array – CompEcon Apr 20 '17 at 20:36
  • I added an example of accessing elements of `xdata` by slice/view. – hpaulj Apr 20 '17 at 22:16