Numpy efficient big matrix multiplication

Question

To store big matrix on disk I use numpy.memmap.

Here is a sample code to test big matrix multiplication:

import numpy as np
import time

rows= 10000 # it can be large for example 1kk
cols= 1000

#create some data in memory
data = np.arange(rows*cols, dtype='float32') 
data.resize((rows,cols))

#create file on disk
fp0 = np.memmap('C:/data_0', dtype='float32', mode='w+', shape=(rows,cols))
fp1 = np.memmap('C:/data_1', dtype='float32', mode='w+', shape=(rows,cols))

fp0[:]=data[:]
fp1[:]=data[:]

#matrix transpose test
tr = np.memmap('C:/data_tr', dtype='float32', mode='w+', shape=(cols,rows))
tr= np.transpose(fp1)  #memory consumption?
print fp1.shape
print tr.shape

res = np.memmap('C:/data_res', dtype='float32', mode='w+', shape=(rows,rows))
t0 = time.time()
# redifinition ? res= np.dot(fp0,tr) #takes 342 seconds on my machine, if I multiplicate matrices in RAM it takes 345 seconds (I thinks it's a strange result)
res[:]= np.dot(fp0,tr) # assignment ?
print res.shape
print (time.time() - t0)

So my questions are :

How to restrict memory consumtion of aplication which is using this procedure to some value for example to 100Mb(or 1Gb or something else).Also I don't understand how to estimate memory consumtion of procedure (I think memory is only allocated when "data" variable is created, but how much memory used when we use memmap files?)
Maybe there is some optimal solution for multiplication of big matrices stored on disk? For example maybe data not optimally stored on disk or readed from disk, not properly chached, and also dot product use only one core.Maybe I should use something like PyTables?

Also I interested in algorithms solving linear system of equations (SVD and others) with restricted memory usage. Maybe this algorithms called out-of-core or iterative and I think there some analogy like hard drive<->ram, gpu ram<->cpu ram, cpu ram<->cpu cache.

Also here I found some info about matrix multiplication in PyTables.

Also I found this in R but I need it for Python or Matlab.

"How to restrict memory consumtion of aplication which is using this procedure to some value for example to 100Mb" You mean that if the application tries to use more memory it should fail? Using `psutil.set_rlimit` it's easy, but AFAIK it works only on linux. — Bakuriu, Oct 14 '13 at 11:39
No, I mean that application must work as normal but use less than declared memory(generally speaking it will be more slowly with less memory, but it usefull when we want to restrict application memory usage or if we haven't enough memory to fit whole matrix). And I work on Windows. — mrgloom, Oct 14 '13 at 11:53
Your `res` line does not make sense (and res is the largest array...). Reread the `np.dot` docstring, you will find something useful... — seberg, Oct 15 '13 at 09:03
what do you mean? output value must be "C-contiguous" ? and np.memmap is not suitable for this? or what? — mrgloom, Oct 15 '13 at 10:10
"Maybe there is some optimal solution for multiplication of big matrices stored on disk?" Hadoop http://es.wikipedia.org/wiki/MapReduce http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ — inigoD, Oct 15 '13 at 11:44
I mean that you are **not** using the memmap, because to do that you would have to use the `out` argument. So no `res[...] = ...` would *not* do it. — seberg, Oct 15 '13 at 12:26
Oh, and while this is better in newer versions of numpy, while the first transpose does not copy the data, dot may expect C-contiguous arrays also for the input, so that you should probably not transpose it (or safe in fortran order, and then transpose to c order). Of course there may be better options generally. — seberg, Oct 15 '13 at 13:33
You can limit the resources available to your program using the [AppVerifier](https://msdn.microsoft.com/en-us/library/aa480483.aspx) tool. Set the private working set size. — RolKau, Sep 15 '15 at 08:17

score 5 · Answer 1 · answered May 01 '15 at 19:21

Dask.array provides a numpy interface to large on-disk arrays using blocked algorithms and task scheduling. It can easily do out-of-core matrix multiplies and other simple-ish numpy operations.

Blocked linear algebra is harder and you might want to check out some of the academic work on this topic. Dask does support QR and SVD factorizations on tall-and-skinny matrices.

Regardless for large arrays, you really want blocked algorithms, not naive traversals which will hit disk in unpleasant ways.

score 2 · Answer 2 · answered May 08 '14 at 11:50

2

Consider using NumExpr for your processing: https://github.com/pydata/numexpr

... internally, NumExpr employs its own vectorized virtual machine that is designed around a chunked-read strategy, in order to efficiently operate on optimally-sized blocks of data in memory. It can handily beat naïve NumPy operations if tuned properly.

NumExpr may cover #2 in your breakdown of the issue. If you address #1 by using a streamable binary format, you can then the chunked-read approach when loading your data files – like so:

    with open('path/to/your-data.bin', 'rb') as binary:
        while True:
            chunk = binary.read(4096) # or what have you
            if not chunk:
                break

If that is too low-level for you, I would recommend you look at the HDF5 library and format: http://www.h5py.org – it’s the best solution for the binary serialization of NumPy-based structures that I know of. The h5py module supports compression, chunked reading, dtypes, metadata… you name it.

Good luck!

answered May 08 '14 at 11:50

fish2000

4,289
2
37
76

I'm not quite understand where numexpr can be used in matrix multiplication, can your expand answer and provide example of usage? Yes I know about hdf5 and pytables but numpy.memmap is more convenient because it can be used as usual numpy array. – mrgloom May 08 '14 at 12:01
`numpy.memmap` is an all-or-none operation, yes?… I was suggesting hdf5 as it supports chunking and streaming I/O: http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/index.html – while simply loading the same data with HDF5 may be slower than memory-mapped files, doing a chunked read to a chunk-based processing loop may be more efficient in the manner you’re looking for. – fish2000 May 08 '14 at 12:09
Regarding NumExpr, I don’t know if it offers a dot-product operator but I do know that it will give you faster view operations – if you were to redo your array ops using NumExpr, you might not need e.g. the memory-consuming transposition. – fish2000 May 08 '14 at 12:15

Numpy efficient big matrix multiplication

2 Answers2

Linked