-1

error

numpy.core._exceptions.MemoryError: Unable to allocate 362. GiB for an array with shape (2700000, 18000) and data type float64

https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data

im working on this netflix prize data set which has a lot of movies and user ids my work is to apply matrix factorization so i need to create a matrix of 2700000 X 18000 which stores int in range 1 to 5 I tried many ways but still unable to create a matrix of that size tried forcing it to be uint8 but the shape of the matrix which im getting is wrong please help me solve this

Education 4Fun
  • 176
  • 3
  • 16

1 Answers1

1

Your 3 million by 20000 matrix better be sparse or you will need a computer with a very large amount of memory. One copy of a full real matrix that size will require a few hundreds GB or even a few TB of contiguous space.

  1. Exploit more efficient matrix representation, like sparse one scipy.sparse.csc_matrix. The question is if the matrix has most of 0 scores.
  2. Modify your algorithm to work on submatrices.
polkas
  • 3,797
  • 1
  • 12
  • 25
  • and how to access scipy.sparse.csc_matrix line by line or element pair ? i tried searching but everywhere it says to convert to dense or matrix which i cant do – Education 4Fun Nov 26 '22 at 10:58
  • The best is to already have a processed by big servers cluster e.g. npz file which could be loaded with scipy.sparse.load_npz . I checked on your reference that the training data came in 17,000+ files which are aggregated to 4 files. There is a version of svd which could work in iterative manner and feed by smaller data batches, https://spark.apache.org/docs/2.2.0/mllib-dimensionality-reduction.html. – polkas Nov 26 '22 at 11:22
  • my goal is to apply matrix factorization on rating matrix – Education 4Fun Nov 26 '22 at 11:32