0

I want to create a Numpy kernel matrix of dimensions 25000*25000. I want to know what is the most efficient way to handle such large matrix in terms of saving it on disk and loading it. I tried dumping it with Pickle, but it threw an error saying it cannot serialize objects of size greater than 4 Gib.

  • try np.save or np.savez – Imtinan Azhar Mar 10 '19 at 09:20
  • Not from a lot of experience, but you might want to look at [pyarrow](https://arrow.apache.org/docs/python/numpy.html) and also at [parquet](http://parquet.apache.org/). `pyarrow` is supposed to already contains parquet. – amitr Mar 10 '19 at 09:24

2 Answers2

1

u could try to save it in h5 file by pandas.HDFStore()

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(25000,25000).astype('float16'))
memory_use = round(df.memory_usage(deep=True).sum()/1024*3,2)
print('use{}G'.format(memory_use))
store = pd.HDFStore('test.h5', 'w)
store['data'] = df
store.close()
Zihan Yang
  • 31
  • 3
1

Why not try to save the array as a file instead of using pickle

np.savetxt("filename",array)

It then can be read by

np.genfromtxt("filename")

GILO
  • 2,444
  • 21
  • 47