Efficiently counting duplicate values in a numpy column and appending the counts

Question

I have a dataset representing a directed graph. The first column is the source node, the second column is the target node, and we can ignore the third column (essentially a weight). So for example:

What I would like to do is append the out-degree for each node. For example, if I just added the out-degree for node 0, I would have:

0 1 3 5
0 13 1 5
0 37 1 5
0 51 1 5
0 438481 1 5
1 0 3
...

I have some code that does this, but it is extremely slow because I am using a for loop:

import numpy as np

def save_degrees(X):
    new_col = np.zeros(X.shape[0], dtype=np.int)
    X = np.column_stack((X, new_col))
    node_ids, degrees = np.unique(X[:, 0], return_counts=True)
    # This is the slow part.
    for node_id, deg in zip(node_ids, degrees):
        indices = X[:, 0] == node_id
        X[:, -1][indices] = deg
    return X

train_X = np.load('data/train_X.npy')
train_X = save_degrees(train_X)
np.save('data/train_X_degrees.npy', train_X)

Is there a more efficient way to build this data structure?

@Divakar, yes. The accepted answer works, is extremely fast on a dataset of 3,348,026 elements, and requires very little additional code. — jds, Apr 12 '17 at 18:47

score 4 · Accepted Answer · answered Apr 05 '17 at 20:59

You can use numpy.unique.

Suppose your input data is in the array data:

In [245]: data
Out[245]: 
array([[     0,      1,      3],
       [     0,     13,      1],
       [     0,     37,      1],
       [     0,     51,      1],
       [     0, 438481,      1],
       [     1,      0,      3],
       [     1,      4,    354],
       [     1,     10,   2602],
       [     1,     11,   2689],
       [     1,     12,      1],
       [     1,     18,    345],
       [     1,     19,    311],
       [     1,     23,      1],
       [     1,     24,    366],
       [     2,     10,      1],
       [     2,     13,      3],
       [     2,     99,      5],
       [     3,     25,     13],
       [     3,     99,     15]])

Find the unique values in the first column, along with the "inverse" array and the counts of the occurrences of each unique value:

In [246]: nodes, inv, counts = np.unique(data[:,0], return_inverse=True, return_counts=True)

Your column of out degrees is counts[inv]:

In [247]: out_degrees = counts[inv]

In [248]: out_degrees
Out[248]: array([5, 5, 5, 5, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 2, 2])

This assumes that a pair (source_node, target_node) does not occur more than once in the data array.

score 3 · Answer 2 · answered Apr 05 '17 at 21:05

np.unique indeed does a fine job here, as explained in some of the other answers.

Still you might want to take a look at numpy_indexed (disclaimer: I am its author); it can do the same thing with the same efficiency, but supports a lot of other functionality as well, that tends to be very useful when working with graphs; or sparse / jagged datastructures in general.

It also has a clean one-line solution to your problem specifically:

import numpy_indexed as npi
X = np.column_stack((X, npi.multiplicity(X[:, 0])))

Psidom · Answer 3 · 2017-04-05T20:53:37.107

You can try this, normally X[:, 0] == node_id is time consuming when you have a lot of distinct nodes. You can sort data by the first column firstly and then create a new count column from that by repeating the counts:

train_X = train_X[train_X[:, 0].argsort()]
_, counts = np.unique(train_X[:,0], return_counts=True)
np.hstack((train_X, np.repeat(counts, counts)[:, None]))

# array([[     0,      1,      3,      5],
#        [     0,     13,      1,      5],
#        [     0,     37,      1,      5],
#        [     0,     51,      1,      5],
#        [     0, 438481,      1,      5],
#        [     1,      0,      3,      9],
#        [     1,      4,    354,      9],
#        [     1,     10,   2602,      9],
#        [     1,     11,   2689,      9],
#        [     1,     12,      1,      9],
#        [     1,     18,    345,      9],
#        [     1,     19,    311,      9],
#        [     1,     23,      1,      9],
#        [     1,     24,    366,      9]])

Or you can use pandas groupby:

import pandas as pd
pd.DataFrame(train_X).pipe(lambda x: x.assign(size = x.groupby([0])[0].transform('size'))).values

#array([[     0,      1,      3,      5],
#       [     0,     13,      1,      5],
#       [     0,     37,      1,      5],
#       [     0,     51,      1,      5],
#       [     0, 438481,      1,      5],
#       [     1,      0,      3,      9],
#       [     1,      4,    354,      9],
#       [     1,     10,   2602,      9],
#       [     1,     11,   2689,      9],
#       [     1,     12,      1,      9],
#       [     1,     18,    345,      9],
#       [     1,     19,    311,      9],
#       [     1,     23,      1,      9],
#       [     1,     24,    366,      9]])

Divakar · Answer 4 · 2017-04-05T21:58:55.223

Here's one vectorized approach with focus on performance -

def argsort_unique(idx):
    # Original idea : http://stackoverflow.com/a/41242285/3293881 
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx

def count_and_append(a): # For sorted arrays
    a0 = a[:,0]
    sf0 = np.flatnonzero(a0[1:] != a0[:-1])+1
    shift_idx = np.concatenate(( [0] , sf0, [a0.size] ))
    c = shift_idx[1:] - shift_idx[:-1]
    out_col = np.repeat(c,c)
    return np.column_stack((a, out_col))

def count_and_append_generic(a): # For generic (not necessarily sorted) arrays
    sidx = a[:,0].argsort()
    b = a[sidx]
    return count_and_append(b)[argsort_unique(sidx)]

Sample run -

In [70]: a # Not sorted case
Out[70]: 
array([[     1,     18,    345],
       [     1,     23,      1],
       [     0,     13,      1],
       [     0,     37,      1],
       [     2,     99,      5],
       [     0,      1,      3],
       [     2,     13,      3],
       [     1,      4,    354],
       [     1,     24,    366],
       [     0, 438481,      1],
       [     1,     12,      1],
       [     1,     11,   2689],
       [     1,     19,    311],
       [     2,     10,      1],
       [     3,     99,     15],
       [     0,     51,      1],
       [     3,     25,     13],
       [     1,      0,      3],
       [     1,     10,   2602]])

In [71]: np.allclose(count_and_append_generic(a), save_degrees(a))
Out[71]: True

If input array is already sorted by the first column, simply use count_and_append(a).

This works and is also fast. The accepted answer requires a little less code, but this is a complete solution if anyone reading this answer wants that. — jds, Apr 12 '17 at 18:54

Efficiently counting duplicate values in a numpy column and appending the counts

4 Answers4