3

i have a very large 1D python array x of somewhat repeating numbers and along with it some data d of the same size.

x = np.array([48531, 62312, 23345, 62312, 1567, ..., 23345, 23345])
d = np.array([0    , 1    , 2    , 3    , 4   , ..., 99998, 99999])

in my context "very large" refers to 10k...100k entries. Some of them are repeating so the number of unique entries is about 5k...15k.

I would like to group them into the bins. This should be done by creating two objects. One is a matrix buffer, b of data items taken from d. The other object is a vector v of unique x values each of the buffer columns refers to. Here's the example:

v =  [48531, 62312, 23345, 1567, ...]
b = [[0    , 1    , 2    , 4   , ...]
     [X    , 3    , ....., ...., ...]
     [ ...., ....., ....., ...., ...]
     [X    , X    , 99998, X   , ...]
     [X    , X    , 99999, X   , ...] ]

Since the numbers of occurrences of each unique number in x vary some of the values in the buffer b are invalid (indicated by the capital X, i.e. "don't care").


It's very easy to derive v in numpy:

v, n = np.unique(x, return_counts=True)  # yay, just 5ms

and we even get n which is the number of valid entries within each column in b. Moreover, (np.max(n), v.shape[0]) returns the shape of the matrix b that needs to be allocated.

But how to efficiently generate b? A for-loop could help

b = np.zeros((np.max(n), v.shape[0]))
for i in range(v.shape[0]):
    idx = np.flatnonzero(x == v[i])
    b[0:n[i], i] = d[idx]

This loop iterates over all columns of b and extracts the indices idxby identifying all the locations where x == v.

However I don't like the solution because of the rather slow for loop (taking about 50x longer than the unique command). I'd rather have the operation vectorized.


So one vectorized approach would be to create a matrix of indices where x == v and then run the nonzero() command on it along the columns. however, this matrix would require memory in the range of 150k x 15k, so about 8GB on a 32 bit system.

To me it sounds rather silly that the np.unique-operation can even efficiently return the inverted indices so that x = v[inv_indices] but that there is no way to get the v-to-x assignment lists for each bin in v. This should come almost for free when the function is scanning through x. Implementation-wise the only challenge would be the unknown size of the resulting index-matrix.


Another way of phrasing this problem assuming that the np.unique-command is the method-to-use for binning:

given the three arrays x, v, inv_indices where v are the unique elements in x and x = v[inv_indices] is there an efficient way of generating the index vectors v_to_x[i] such that all(v[i] == x[v_to_x[i]]) for all bins i?

I shouldn't have to spend more time than for the np.unique-command itself. And I'm happy to provide an upper bound for the number of items in each bin (say e.g. 50).

rava
  • 183
  • 12
  • 1
    Does `itertools.groupby` help? I'm not sure what you're looking for... – user202729 Feb 06 '18 at 15:52
  • 2
    If you define in Pandas your numpy arrays as `df = pd.DataFrame({"x": x, "d": d})`, then you can collate the data for each unique value with `unik = df.groupby(["x"])["d"].unique().reset_index()`. This is not an array, like you want, but a column contains numpy arrays of all d values for each unique x value. Idk, what you want to do next with your data, but maybe you don't need a full b array with all those NaN values. – Mr. T Feb 06 '18 at 20:32
  • @user202729 I looked into this groupby-bussines and the problem seems to be that groupby only groups the values itself, not their indices in the list so if you invoke `for key,value in groupby(x)` you can turn the group into a list `l = list(value)`. This will leave you with `True == (li == key) for all li in l`. I would need the indices, i.e. ` True == (x[li] == key) for all li in l` – rava Feb 07 '18 at 10:38
  • 1
    What about `enumerate`? – user202729 Feb 07 '18 at 10:39
  • @Piinthesky even though your suggestion works the grouping command `unik = df.groupby(["x"])["d"].unique().reset_index()` takes ~900msec to compute for the kind of problems described above. This is not an option unless it can somehow be sped up. – rava Feb 07 '18 at 11:47
  • @Piinthesky Only if it gets heavily rewritten. As is, the question is off-topic there. See https://codereview.meta.stackexchange.com/questions/5777/a-guide-to-code-review-for-stack-overflow-users. It's perfectly fine here on SO. – Zeta Feb 07 '18 at 12:17

2 Answers2

0

based on the suggestion from @user202729 I wrote this code

x_sorted_args = np.argsort(x)
x_sorted = x[x_sorted_args]

i = 0
v = -np.ones(T)
b = np.zeros((K, T))

for k,g in groupby(enumerate(x_sorted), lambda tup: tup[1]):
    groups = np.array(list(g))[:,0]
    size = groups.shape[0]

    v[i] = k
    b[0:size, i] = d[x_sorted_args[groups]]
    i += 1

in runs in about ~100ms which results in some considerable speedup w.r.t. the original code posted above.

It first enumerates the values in x adding the corresponding index information. Then the enumeration is grouped by the actual x value which in fact is the second value of the tuple generated by enumerate().

The for loop iterates over all the groups turning those iterators of tuples g into the groups matrix of size (size x 2) and then throws away the second column, i.e. the x values keeping only the indices. This leads to groups being just a 1D array.

groupby() only works on sorted arrays.


Good work. I'm just wondering if we can do even better? Still a lot of unreasonable data copying seems to happen. Creating a list of tuples and then turning this into a 2D matrix just to throw away half of it still feels a bit suboptimal.

rava
  • 183
  • 12
0

I received the answer I was looking for by rephrasing the question, see here: python: vectorized cumulative counting

by "cumulative counting" the inv_indices returned by np.unique() we receive the array indices of the sparse matrix so that

c = cumcount(inv_indices)
b[inv_indices, c] = d

cumulative counting as proposed in the thread linked above is very efficient. Run times lower than 20ms are very realistic.

rava
  • 183
  • 12