2

I have a numpy array of strings, some duplicated, and I'd like to compare every element with every other element to produce a new vector of 1's and 0's indicating whether each pair (i,j) is the same or different.

e.g. ["a","b","a","c"] -> 12-element (4*3) vector [1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1]

Is there a way to do this quickly in numpy without a double loop through all pairs of elements? My array has ~240,000 elements, so it's taking a terribly long time to do it the naive way.

I'm aware of numpy.equal.outer, but apparently numpy.equal is not implemented on strings, so it seems like I'll need some more clever way to compare them.

Jess
  • 1,515
  • 3
  • 23
  • 32
  • 1
    How did you get `[1,0,1,0,0,0,0,0,0,0,0,0]`? Could you explain? Maybe implement a double loopy version that we could try to vectorize? – Divakar Mar 10 '17 at 17:57
  • you really need to explain your transfomation from `[a,b,c,d]=> other thing` better – Joran Beasley Mar 10 '17 at 17:57

2 Answers2

3

Build an array that contains the hash (using built-in hash() function) values of the strings.

eg = ['a', 'b', 'c', 'a']
hashed = np.array([hash(s) for s in eg])
result = np.equal.outer(hashed, hashed)

outputs:

[[ True False False  True]
 [False  True False False]
 [False False  True False]
 [ True False False  True]]

If there are only 1-character-long strings, you can use ord() instead of hash():

Given a string of length one, return an integer representing the Unicode code point of the character when the argument is a unicode object, or the value of the byte when the argument is an 8-bit string. For example, ord('a') returns the integer 97, ord(u'\u2020') returns 8224.

1

tl;dr

You don't want that.

Details

First let's note that you're actually building a triangular matrix: for the first element, compare it to the rest of the elements, then repeat recursively to the rest. You don't use the triangularity, though. You just cut off the diagonal (each element is always equal to itself) and merge the rows into one list in your example.

If you sort your source list, you won't need to compare each element to the rest of the elements, only to the next element. You'd have to keep the position with element using a tuple, to keep track of it after sorting.

You would sort the list of pairs in O(n log n) time, then scan it and find all the matches if O(n) time. Both sorting and finding the matches are simple and quick in your case.

After that, you'd have to create your 'bit vector', which is O(n^2) long. It would contain len(your vector) ** 2 elements, or 57600 million elements for a 240k-element vector. Even if you represented each element as one bit, it would take 53.6 Gbit, or 8.7 GBytes of memory.

Likely you don't want that. I suggest that you find a list of pairs in O(n log n) time, sort it by both first and second position in O(n log n) time, too, and recreate any portion of your desired bitmap by looking at that list of pairs; binary search would really help. Provided that you have much fewer matches than pairs of elements, the result may even fit in RAM.

Community
  • 1
  • 1
9000
  • 39,899
  • 9
  • 66
  • 104