Efficient way of inputting large raster data into PyTables

Question

I am looking for the efficient way to feed up the raster data file (GeoTiff) with 20GB size into PyTables for further out of core computation.

Currently I am reading it as numpy array using Gdal, and writing the numpy array into pytables using the code below:

import gdal, numpy as np, tables as tb

inraster = gdal.Open('infile.tif').ReadAsArray().astype(np.float32)
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=np.shape(inraster)
dataset[:] = inraster
dataset.flush()
dataset.close()
f.close()
inraster = None

Unfortunately, since my input file is extremely large, while reading it as numpy error, my PC shows memory error. Is there any alternative way to feed up the data into PyTables or any suggestions to improve my code?

I don't have time for a complete answer right now, but basically, you just need to read the geotiff into memory in chunks. `ReadAsArray` takes optional xoffset, yoffset, xwindow, and ywindow parameters (in "pixel" units) that allow you to read in only a subset of the array. You can create the "full" PyTables array, and then read each row/whatever of the geotiff into memory and assign the corresponding row/whatever of the PyTables array. Hopefully that helps, at any rate! — Joe Kington, Mar 07 '14 at 18:41
I really did not know how to combine the multiple arrays as read by numpy in the parts of whole image into one single array in PyTables. Your complete answer would be highly appreciated. — , Apr 26 '14 at 05:31

score 9 · Accepted Answer · answered Jul 12 '14 at 23:04

I do not have a geotiff file, so I fiddled around with a normal tif file. You may have to omit the 3 in the shape and the slice in the writing if the data to the pytables file. Essentially, I loop over the array without reading everything into memory in one go. You have to adjust n_chunks so the chunksize that gets read in one go does not exceed your system memory.

ds=gdal.Open('infile.tif')

x_total,y_total=ds.RasterXSize,ds.RasterYSize

n_chunks=100

f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=(3,y_total,x_total)


#prepare the chunk indices
x_offsets=linspace(0,x_total,n_chunks).astype(int)
x_offsets=zip(x_offsets[:-1],x_offsets[1:])
y_offsets=linspace(0,y_total,n_chunks).astype(int)
y_offsets=zip(y_offsets[:-1],y_offsets[1:])

for x1,x2 in x_offsets:
    for y1,y2 in y_offsets:
        dataset[:,y1:y2,x1:x2]=ds.ReadAsArray(xoff=x1,yoff=y1,xsize=x2-x1, ysize=y2-y1)

Efficient way of inputting large raster data into PyTables

1 Answers1