1

This must be easy, but I'm very new to pytables. My application has dataset sizes so large they cannot be held in memory, thus I use PyTable CArrays. However, I need to find the maximum element in an array that is not infinity. Naively in numpy I'd do this:

max_element = numpy.max(array[array != numpy.inf])

Obviously that won't work in PyTables without introducing a whole array into memory. I could loop through the CArray in windows that fit in memory, but it'd be surprising to me if there weren't a max/min reduction operation. Is there an elegant mechanism to get the conditional maximum element of that array?

falsetru
  • 357,413
  • 63
  • 732
  • 636
Rich
  • 12,068
  • 9
  • 62
  • 94
  • just out of interest: what dataset do you have that it cant fit in memory? or are you working on 32bit system? – usethedeathstar Jan 03 '14 at 06:57
  • Yes, I'm working on GIS data that exceeds 4GB and needs to run on a 32 bit machine. – Rich Jan 03 '14 at 07:05
  • http://stackoverflow.com/questions/9953174/what-is-the-equivalent-of-select-maxcolumn-from-table-in-pytables does this solve your problem? See the answer from Markus Jarderot – M4rtini Jan 03 '14 at 07:08
  • I believe pytables in-kernel computations is done with numexpr, and as far as i know, Numexpr only support sum and prod reduction. So looping over chunks that fit in memory might be your best bet. – M4rtini Jan 03 '14 at 07:11
  • i guess the lazy solution would be getting 16Gb of ram n a 64bit system, though that just postpones the problems to the moment where your data has grown larger again – usethedeathstar Jan 03 '14 at 08:40
  • @usethedeathstar Unfortunately we have a wide userbase of 32 bit machines. We even have to support XP. :) – Rich Jan 04 '14 at 04:29

1 Answers1

4

If your CArray is one dimensional, it is probably easier to stick it in a single-column Table. Then you have access to the where() method and can easily evaluate expressions like the following.

from itertools import imap
max(imap(lamdba r: r['col'], tab.where('col != np.inf')))

This works because where() never reads in all the data at once and returns an iterator, which is handed off to map, which is handed off to max. Note that in Python 3, you don't need to import imap() and imap() becomes just the builtin map().

Not using a table means that you need to use the Expr class and do more of the wiring yourself.

Anthony Scopatz
  • 3,265
  • 2
  • 15
  • 14