0

I have several large tables.carray data structures of the same shape (300000x300000). I want to add all the data and store it in a master matrix.

Right now, I create a new carray and fill it with a simple loop:

shape = (300000,300000)
#... open all hdf5 files of the existing matrices and create a new one
matrix = h5f.createCArray( h5f.root, 'carray', atom, shape, filters=filters )

for i in range( shape[0] ):
  for j in range( shape[1] ):

    for m in single_matrices:

      # print 'reading', i,j,shape
      value = m[i, j]

      # print 'writing'
      matrix[i, j] += value

But it is very slow (>12 hours). Is there a better way?

Community
  • 1
  • 1
haehn
  • 967
  • 1
  • 6
  • 19
  • `single_matrices` is the list of all of the large carrays I want to add. moving the `for m in single_matrices` to the top made it even slower. – haehn Jun 11 '13 at 13:47
  • I'm not familiar with pytables, but I suppose you would able to read whole rows into NumPy arrays and add them up. – Janne Karila Jun 12 '13 at 11:07

1 Answers1

0

You really should be using the Expr() class to evaluate this [1]. It uses numexpr under the hood to compute the desired operations in parallel on chunks. Using the out argument will even write the result back out to disk as it computes. This ensures that the full array is never in memory.

  1. http://pytables.github.io/usersguide/libref/expr_class.html
Anthony Scopatz
  • 3,265
  • 2
  • 15
  • 14