8

So, I have lists of words and I need to know how often each word appears on each list. Using ".count(word)" works, but it's too slow (each list has thousands of words and I have thousands of lists).

I've been trying to speed things up with numpy. I generated a unique numerical code for each word, so I could use numpy.bincount (since it only works with integers, not strings). But I get "ValueError: array is too big".

So now I'm trying to tweak the "bins" argument of the numpy.histogram function to make it return the frequency counts I need (somehow numpy.histogram seems to have no trouble with big arrays). But so far no good. Anyone out there happens to have done this before? Is it even possible? Is there some simpler solution that I'm failing to see?

Parzival
  • 2,004
  • 4
  • 33
  • 47
  • How big is your array!? I just used bincount with a 10000000 length array of ints and it worked just fine. I run out of memory before I get the error you do. – Henry Gomersall Jun 04 '13 at 22:05
  • 1
    I think the issue here involves your unique numerical code system, not the size of the initial arrays. np.bincount will create an array of length equal to 1 + the largest integer in your array, which, if you're using some sort of coding with ridiculously large numbers, might cause a problem. Still, I had no problem with np.bincount([1000000000]). What is your numerical coding scheme? – cge Jun 04 '13 at 22:13
  • 1
    Ah it seems that error occurs when you're integers you're trying to bin are huge. You can emulate it with `foo = numpy.random.randint(2**62, size=1000); numpy.bincount(foo)`. I guess it's trying to create a huge unindexable array to store all the bins and numpy is saying no (that error is in `multiarray/ctors.c`). How many words do you have? – Henry Gomersall Jun 04 '13 at 22:17
  • Henry and cge, I think you're both spot on. To create the numerical codes I'm using the binascii.crc32 function (from the binascii module). Those are big numbers. Worse: to ensure that all numerical codes are positive, I'm doing squaring those numbers. I'll find some way to produce smaller numerical codes -- I'm guessing that should do the trick. Many thanks! – Parzival Jun 04 '13 at 22:23

3 Answers3

7

Don't use numpy for this. Use collections.Counter instead. It's designed for this use case.

Robert Kern
  • 13,118
  • 3
  • 35
  • 32
5

Why not reduce your integers to the minimum set using numpy.unique:

original_keys, lookup_vals = numpy.unique(big_int_string_array, return_inverse=True)

You can then just use numpy.bincount on lookup_vals, and if you need to get back the original string unique integer, you can just use the the values of lookup_vals as indices to original_keys.

So, something like:

import binascii
import numpy

string_list = ['a', 'b', 'c', 'a', 'b', 'd', 'c']
int_list = [binascii.crc32(string)**2 for string in string_list]

original_keys, lookup_vals = numpy.unique(int_list, return_inverse=True)

bins = bincount(lookup_vals)

Also, it avoids the need to square your integers.

Henry Gomersall
  • 8,434
  • 3
  • 31
  • 54
1

Thiago, You can also try it directly from the categorical variables with scipy's itemfreq method. Here's an example:

>>> import scipy as sp
>>> import scipy.stats
>>> rv = ['do', 're', 'do', 're', 'do', 'mi']
>>> note_frequency = sp.stats.itemfreq(rv)
>>> note_frequency
array([['do', '3'],
       ['mi', '1'],
       ['re', '2']],
      dtype='|S2')
rafaelvalle
  • 6,683
  • 3
  • 34
  • 36