3

I have two numpy arrays, users and dat. For each user in users I need to find the data related to the user in dat and count the number of unique values. I need to process a case where len(users)=200000 and len(dat)=2800000. Currently I am not exploiting the fact that dat is sorted, making the method very slow. How do I do this?

The value 'other' in dat merely shows that other values will be present in the structured array as well.

import numpy as np

users = np.array([111, 222, 333])
info = np.zeros(len(users))
dt = [('id', np.int32), ('group', np.int16), ('other', np.float)]
dat = np.array([(111, 1, 0.0), (111, 3, 0.0), (111, 2, 0.0), (111, 1, 0.0),
               (222, 1, 0.0), (222, 1, 0.0), (222, 4, 0.0),
               (333, 2, 0.0), (333, 1, 0.0), (333, 2, 0.0)],
               dtype=dt)

for i, u in enumerate(users):
    u_dat = dat[np.in1d(dat['id'], u)]
    uniq = set(u_dat['group'])
    info[i] = int(len(uniq))

print info
pir
  • 5,513
  • 12
  • 63
  • 101
  • In C, you loop and increment your counter when current value != previous value. This probably isn't helpful here, since looping array elements in python is generally not how you write fast numpy code. – Peter Cordes Aug 31 '15 at 04:33

1 Answers1

2

If you want to profit from numpy's vectorization, it would help greatly if you could remove all duplicates from dat before hand. You can then find the first and last occurrence of a value with two calls to searchsorted:

dat_unq = np.unique(dat)
first = dat_unq['id'].searchsorted(users, side='left')
last =  dat_unq['id'].searchsorted(users, side='right')
info = last - first

This will only be advantageous if you are going to search for a lot of the entries in dat. If it is a smaller fraction, you can still use the two calls to searchsorted to figure out which slices to call unique on:

info = np.empty_like(users, dtype=np.intp)
first = dat['id'].searchsorted(users, side='left')
last =  dat['id'].searchsorted(users, side='right')
for idx, (start, stop) in enumerate(zip(first, last)):
    info[idx] = len(np.unique(dat[start:stop]))
Jaime
  • 65,696
  • 17
  • 124
  • 159