1

I'm working with a very significant amount of data in Python and I'm running for loops on very large lists of class objects. This takes forever, obviously, and I'm realizing the best solution is to vectorize my lists using numpy arrays. However, I've been unable to come across a method that would allow me to convert my object lists into the vectors I require.

If I have a list with, say, 5 instances of a "sentence" class, and these objects have attributes such that each instance in the list looks something like this:

{
    text: "I liked this phone.",
    rating: 5.0,
    positive: True
}

is there a way to turn this into a 5x3 numpy vector, where each row[0] would give me the object's text?

MP12389
  • 305
  • 1
  • 3
  • 10

2 Answers2

1

So your object is a dictionary like:

In [49]: dd = {
    ...:     'text': "I liked this phone.",
    ...:     'rating': 5.0,
    ...:     'positive': True
    ...: }

I could make an object dtype array that contains 5 copies of this dictionary (or similar objects):

In [50]: arrO = np.empty((5,), object)
In [51]: dict(dd)
Out[51]: {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True}
In [52]: for i in range(5):
    ...:     arrO[i] = dict(dd)
    ...:     
In [53]: arrO
Out[53]: 
array([{'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
       {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
       {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
       {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
       {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True}],
      dtype=object)

But such an object array is much like a list - both contain pointers to objects elsewhere in memory:

In [54]: [dict(dd) for _ in range(5)]
Out[54]: 
[{'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
 {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
 {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
 {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True},
 {'text': 'I liked this phone.', 'rating': 5.0, 'positive': True}]

Iteration on the list is faster. Most operations on the object array involve iteration, with the exception things like reshape that don't require access to individual elements.

Another option is to make structured array.

Key to making a structured array is defining a compound dtype, and providing data in the form of a list of tuples:

In 3.6 dictionary order is deterministic, so values gives the desired order:

In [55]: tuple(dd.values())
Out[55]: ('I liked this phone.', 5.0, True)

In [56]: dt = np.dtype([('text','U30'),('rating',float),('positive',bool)])
In [57]: dt
Out[57]: dtype([('text', '<U30'), ('rating', '<f8'), ('positive', '?')])

Make the array with a list of tuples:

In [58]: arrS = np.array([tuple(dd.values()) for _ in range(5)],dtype=dt)
In [59]: arrS
Out[59]: 
array([('I liked this phone.', 5.,  True),
       ('I liked this phone.', 5.,  True),
       ('I liked this phone.', 5.,  True),
       ('I liked this phone.', 5.,  True),
       ('I liked this phone.', 5.,  True)],
      dtype=[('text', '<U30'), ('rating', '<f8'), ('positive', '?')])

Access fields by name. Note that this is a 1d array (5,) with 3 fields, not a (5,3) array:

In [60]: arrS['rating']
Out[60]: array([5., 5., 5., 5., 5.])
In [61]: arrS['positive']
Out[61]: array([ True,  True,  True,  True,  True])

Modifying the values of the fields:

In [62]: arrS['positive'] = [1,0,0,1,0]
In [63]: arrS['rating'] = np.arange(5)
In [64]: arrS
Out[64]: 
array([('I liked this phone.', 0.,  True),
       ('I liked this phone.', 1., False),
       ('I liked this phone.', 2., False),
       ('I liked this phone.', 3.,  True),
       ('I liked this phone.', 4., False)],
      dtype=[('text', '<U30'), ('rating', '<f8'), ('positive', '?')])

We can do math on the numeric fields:

In [65]: np.sum(arrS['rating'])
Out[65]: 10.0

Using the boolean field as mask:

In [66]: arrS[arrS['positive']]
Out[66]: 
array([('I liked this phone.', 0.,  True),
       ('I liked this phone.', 3.,  True)],
      dtype=[('text', '<U30'), ('rating', '<f8'), ('positive', '?')])
In [67]: arrS[~arrS['positive']]
Out[67]: 
array([('I liked this phone.', 1., False),
       ('I liked this phone.', 2., False),
       ('I liked this phone.', 4., False)],
      dtype=[('text', '<U30'), ('rating', '<f8'), ('positive', '?')])

Operations on a structured array are faster than ones on the object dtype, though a bit slower than similar ones on a standalone array or fully numeric one.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

In my object class, I created an as_dict() method that returned the object as a dictionary. From there, I applied a dictionary version of each of my objects to a pandas dataframe, then called as_matrix() to get it as a numpy array. Seemed to do the trick!

MP12389
  • 305
  • 1
  • 3
  • 10