3

My code is shown below:

from sklearn.datasets import load_svmlight_files
import numpy as np

perm1 =np.random.permutation(25000)
perm2 = np.random.permutation(25000)

X_tr, y_tr, X_te, y_te = load_svmlight_files(("dir/file.feat", "dir/file.feat"))

#randomly shuffle data
X_train = X_tr[perm1,:].toarray()[:,0:2000]
y_train = y_tr[perm1]>5 #turn into binary problem

The code works fine until here, but when I try to convert one more object to an array, my program returns a memory error.

Code:

X_test = X_te[perm2,:].toarray()[:,0:2000]

Error:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-7-31f5e4f6b00c> in <module>()
----> 1 X_test = X_test.toarray()

C:\Users\Asq\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\compressed.pyc in toarray(self, order, out)
    788     def toarray(self, order=None, out=None):
    789         """See the docstring for `spmatrix.toarray`."""
--> 790         return self.tocoo(copy=False).toarray(order=order, out=out)
    791 
    792     ##############################################################

C:\Users\Asq\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\coo.pyc in toarray(self, order, out)
    237     def toarray(self, order=None, out=None):
    238         """See the docstring for `spmatrix.toarray`."""
--> 239         B = self._process_toarray_args(order, out)
    240         fortran = int(B.flags.f_contiguous)
    241         if not fortran and not B.flags.c_contiguous:

C:\Users\Asq\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\base.pyc in _process_toarray_args(self, order, out)
    697             return out
    698         else:
--> 699             return np.zeros(self.shape, dtype=self.dtype, order=order)
    700 
    701 

MemoryError: 

I'm new in python, and I dont know whether one needs to manually fix the memory error.

Other parts of my code return the same errors (like training with knn or ann).

How can I fix this?

Vedaad Shakib
  • 739
  • 7
  • 20
Asqan
  • 4,319
  • 11
  • 61
  • 100
  • You probably exhausted your system's available memory. Buy more or allocate more (swap/paging). – Brian Cain May 26 '14 at 23:55
  • i use windows and swap memory is now extended to 4gb. My ram is 8gb. And python use now 2.5 gb of my memory (just code until here is ran). – Asqan May 26 '14 at 23:58
  • It would be helpful if you could replace the line in your code that loads the svm data by setting these variables to something random with the same shape and matrix type so that one can try to reproduce the problem by copying and pasting. If you are unable to do this, at least provide the shapes of the arrays. – eickenberg May 27 '14 at 06:20

2 Answers2

7

In cases like these, it's often possible to avoid converting your sparse matrices to dense format.

For example, you can do the permutation and slice easily with CSR or CSC sparse formats.

You haven't posted the code that follows, but I suspect that can be made to handle sparse inputs as well. If that's true, your memory issues will no longer be a problem.

Community
  • 1
  • 1
perimosocordiae
  • 17,287
  • 14
  • 60
  • 76
  • Your suggestion is true. as long as i dont really need the dense format. But i need to [scale](http://scikit-learn.org/stable/modules/preprocessing.html) data and for some machine learning algorithms i'll need dense format. But i'm afraid that your suggestion is the only answer in order to solve the memory error. But then, i'll not be able to algorithms i wanted to use. – Asqan May 27 '14 at 01:33
  • 2
    @Asqan Whether you need to scale depends on the nature of the data. Sparse data are often histograms, and those should be L2-normalized rather than scaled. L2 normalization preserves sparsity. – Fred Foo May 27 '14 at 09:42
  • I know this is an old question but my first reaction was looking for a way to assign a dtype other than float64. toarray() fills with 0.0 in float64 format. Is there a way to do that? – MehmedB Jan 21 '20 at 11:01
3

Use numpy.asarray() in-place conversion instead of toarray() which requires new memory.

Rohit Kumar
  • 136
  • 1
  • 6