5

I'm new to python, coming from matlab. I have a large sparse matrix saved in matlab v7.3 (HDF5) format. I've so far found two ways of loading in the file, using h5py and tables. However operating on the matrix seems to be extremely slow after either. For example, in matlab:

>> whos     
  Name           Size                   Bytes  Class     Attributes

  M      11337x133338            77124408  double    sparse    

>> tic, sum(M(:)); toc
Elapsed time is 0.086233 seconds.

Using tables:

t = time.time()
sum(f.root.M.data)
elapsed = time.time() - t
print elapsed
35.929461956

Using h5py:

t = time.time()
sum(f["M"]["data"])
elapsed = time.time() - t
print elapsed

(I gave up waiting ...)

[EDIT]

Based on the comments from @bpgergo, I should add that I've tried converting the result loaded in by h5py (f) into a numpy array or a scipy sparse array in the following two ways:

from scipy import sparse
A = sparse.csc_matrix((f["M"]["data"], f["M"]["ir"], f["tfidf"]["jc"]))

or

data = numpy.asarray(f["M"]["data"])
ir = numpy.asarray(f["M"]["ir"])
jc = numpy.asarray(f["M"]["jc"])    
    A = sparse.coo_matrix(data, (ir, jc))

but both of these operations are extremely slow as well.

Is there something I'm missing here?

tdc
  • 8,219
  • 11
  • 41
  • 63

3 Answers3

3

Most of your problem is that you're using python sum on what's effectively a memory-mapped array (i.e. it's on disk, not in memory).

First off, you're comparing the time it takes to read things from disk to the time it takes to read things in memory. Load the array into memory first, if you want to compare to what you're doing in matlab.

Secondly, python's builtin sum is very inefficent for numpy arrays. (Or, rather, iterating through every item of a numpy array independently is very slow, which is what python's builtin sum is doing.) Use numpy.sum(yourarray) or yourarray.sum() instead for numpy arrays.

As an example:

(Using h5py, because I'm more familiar with it.)

import h5py
import numpy as np

f = h5py.File('yourfile.hdf', 'r')
dataset = f['/M/data']

# Load the entire array into memory, like you're doing for matlab...
data = np.empty(dataset.shape, dataset.dtype)
dataset.read_direct(data)

print data.sum() #Or alternately, "np.sum(data)"
Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • Loading the file in was almost instantaneous in Matlab (<1sec) so I think the comparison was fair, but I take your point about the built-in sum function. I think more and more people will be doing what I'm doing (moving from Matlab to Python) so it would be good if there was a little bit more support for loading in Matlab files IMHO ... – tdc Dec 16 '11 at 09:44
  • 1
    Well, I can't test it without your file, but actually loading the array in python should be very quick as well. What you're currently doing isn't actually loading it. It reutrns what's effectively a memory-mapped array. Accessing it independently will be very slow in any language, as it's mostly disk seeks. Is the example code above still slow? Also, have a look at `scipy.io.loadmat` http://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html#scipy.io.loadmat , though I'm not sure if it supports sparse arrays. – Joe Kington Dec 16 '11 at 16:50
2

The final answer for posterity:

import tables, warnings
from scipy import sparse

def load_sparse_matrix(fname) :
    warnings.simplefilter("ignore", UserWarning) 
    f = tables.openFile(fname)
    M = sparse.csc_matrix( (f.root.M.data[...], f.root.M.ir[...], f.root.M.jc[...]) )
    f.close()
    return M
Danica
  • 28,423
  • 6
  • 90
  • 122
tdc
  • 8,219
  • 11
  • 41
  • 63
0

You're missing NumPy; here is a guide for Matlab users.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
bpgergo
  • 15,669
  • 5
  • 44
  • 68
  • Any more clues? If I do `M = numpy.asarray(f["M"]["data"])` this seems to take forever ... – tdc Dec 06 '11 at 16:27
  • @tdc, I do not even know what is `f` in your code. Try to consult to this page: http://docs.scipy.org/doc/scipy/reference/generated/scipy.io.loadmat.html. Though I read that, you'll still need an HDF5 python lib to load v7.3 Matlab files. – bpgergo Dec 06 '11 at 16:38
  • Also there's nothing on there about sparse matrices – tdc Dec 06 '11 at 16:45
  • Sorry `f` was loaded in using h5py: `f = h5py.File('filename.mat')` – tdc Dec 06 '11 at 16:46
  • Numpy can handle [sparse](http://docs.scipy.org/doc/scipy/reference/sparse.html) matrices as well. If I understand correctly, you cannot even load the Matlab format file into a numpy matrix. In this case I really suggest to start a new question on this specific issue (at least I cannot help on this one). I do hope you'll be fine after you accomplished that. – bpgergo Dec 06 '11 at 16:58
  • I think the title of the question is appropriate: "Loading Matlab sparse matrix saved with -v7.3 (HDF5) into Python and operating on it". Once the file is loaded in (whether it be through `h5py` or `tables`) performing any operations on it seem to take forever: either operating directly on the objects, or by converting using `numpy.asarray` or `scipy.sparse.coo_matrix`. I'm presuming someone has encountered this specific problem before - files saved in that format are quite common. – tdc Dec 06 '11 at 17:13
  • Well, yes, you're right. Sorry, in the meantime I got involved some other stuff. I hope that someone who has encountered this problem will see and answer this question. – bpgergo Dec 06 '11 at 17:27