Python - Best data structure for incredibly large matrix

Question

I need to create about 2 million vectors w/ 1000 slots in each (each slot merely contains an integer).

What would be the best data structure for working with this amount of data? It could be that I'm over-estimating the amount of processing/memory involved.

I need to iterate over a collection of files (about 34.5GB in total) and update the vectors each time one of the the 2-million items (each corresponding to a vector) is encountered on a line.

I could easily write code for this, but I know it wouldn't be optimal enough to handle the volume of the data, which is why I'm asking you experts. :)

Best, Georgina

Does it have to be Python? You can get a much more tightly-packed layout in C (or Cython, if you need Python interop). Relatedly, NumPy might be an option. — , Mar 22 '11 at 21:06
This data structure will need 8GB of RAM. Do you have that much? — Sven Marnach, Mar 22 '11 at 21:06
What range of integers do you need to store (smallest and largest possible value)? — Mark Byers, Mar 22 '11 at 21:07
You should tell us more about how you're going to handle the data. With this information, it's hard to give a definite answer. — StackExchange saddens dancek, Mar 22 '11 at 21:08

JoshAdel · Accepted Answer · 2011-03-22T21:38:54.187

5

You might be memory bound on your machine. Without cleaning up running programs:

a = numpy.zeros((1000000,1000),dtype=int)

wouldn't fit into memory. But in general if you could break the problem up such that you don't need the entire array in memory at once, or you can use a sparse representation, I would go with numpy (scipy for the sparse representation).

Also, you could think about storing the data in hdf5 with h5py or pytables or netcdf4 with netcdf4-python on disk and then access the portions you need.

edited Mar 22 '11 at 21:38

answered Mar 22 '11 at 21:07

JoshAdel

66,734
27
141
140

scipy have a specifically structures for sparse matrices, try http://docs.scipy.org/doc/scipy/reference/sparse.html – renatopp Mar 22 '11 at 21:26

score 1 · Answer 2 · answered Mar 22 '11 at 21:13

1

Use a sparse matrix assuming most entries are 0.

answered Mar 22 '11 at 21:13

Jeroen Dirks

7,705
12
50
70

score 1 · Answer 3 · answered Mar 22 '11 at 21:20

1

If you need to work in RAM try the scipy.sparse matrix variants. It includes algorithms to efficiently manipulate sparse matrices.

answered Mar 22 '11 at 21:20

samplebias

37,113
6
107
103

Python - Best data structure for incredibly large matrix

3 Answers3