0

I'd like to convert the csv files into hdf5 format,which are used for caffe training.Because the csv files is 80G,it will report memory error.The machine memory is 128G.So can it possbile to improve my code?handle it one by one?Below is my code,it reported memory error when run in np.array

if '__main__' == __name__:
        print 'Loading...'  
        day = sys.argv[1]
        file = day+".xls"
        data = pd.read_csv(file, header=None)
        print data.iloc[0,1:5]

        y = np.array(data.iloc[:,0], np.float32)
        x = np.array(data.iloc[:,1:], np.float32)

        patch = 100000

        dirname = "hdf5_" + day 
        os.mkdir(dirname)
        filename = dirname+"/hdf5.txt"
        modelname = dirname+"/data"
        file_w = open(filename, 'w')
        for idx in range(int(math.ceil(y.shape[0]*1.0/patch))):    
                with h5py.File(modelname + str(idx) + '.h5', 'w') as f:
                        d_begin = idx*patch
                        d_end = min(y.shape[0], (idx+1)*patch)
                        f['data'] = x[d_begin:d_end,:]

                         f['label'] = y[d_begin:d_end]

                file_w.write(modelname + str(idx) + '.h5\n')
        file_w.close()
Fabio Lamanna
  • 20,504
  • 24
  • 90
  • 122
刘米兰
  • 183
  • 2
  • 11

2 Answers2

0

The best approach would be to read n lines and then write these to the HDF5 file, extending it be n elements each time. This way the amount of memory needed is not dependent on the size of the csv file. You could read a line at a time as well, but that would be slightly less efficient.

Here's code that applies this process for reading weather station data: https://github.com/HDFGroup/datacontainer/blob/master/util/ghcn/convert_ghcn.py.

John Readey
  • 531
  • 3
  • 6
0

Actually, since you treat chunks of size 100000 separately, there is no need to load the whole CSV at one. The chunksize option in the read_csv is exactly for this case.

When specifying chunksize, read_csv will become an iterator, returning DataFrames's of size chunksize. You can iterate over instead of slicing the arrays each time.

Minus all the lines setting the different variables, your code should look more like this:

chuncks = pd.read_csv(file, header=None, chunksize=100000)

for chunk_number, data in enumerate(chunks):
    y = np.array(data.iloc[:,0], np.float32)
    x = np.array(data.iloc[:,1:], np.float32)

    file_w = open(filename, 'w')
    with h5py.File(modelname + str(idx) + '.h5', 'w') as f:
        f['data'] = x
        f['label'] = y
        file_w.write(modelname + str(chunk_number) + '.h5\n')
    file_w.close()
tmrlvi
  • 2,235
  • 17
  • 35