I'm trying to plot an empirical cumulative distribution function (CDF) of data from a 380Gb binary raster. Using just a small mask of the data, the following code works perfectly.
import numpy as np
import matplotlib.pyplot as plt
dem_name = open('./raster.dem','rb')
vals = np.fromfile(dem_name,dtype='float32')
vals = np.negative(vals[vals!=-9999])
vals = np.sort(vals)
y = np.arange(1.,len(vals)+1.)/len(vals)
plt.plot(vals,y)
However, when I try to load the whole raster using this code, it obviously gives a memory error. My computer has 9Tb of disk space but is limited to 16Gb of RAM, so I have used numpy.memmap to get the raster values into an array.
dem_name = open('./raster.dem','rb')
vals = np.memmap(dem_name,dtype='float32','r')
This works, but I need to trim the nodata values (-9999) from the raster, switch the sign of the values (negative values becomes positive) and sort the values from lowest to highest.
vals_real = np.memmap(np.sort(np.negative(vals[vals!=-9999])))
This runs for a few hours and then gives a memory error.
The y array,
y = np.arange(1.,len(vals)+1.)/len(vals)
is also too big to be stored in RAM (gives a memory error), but I can't figure out how to store the array as a memmap object.
Is it correct that in order to plotting also takes memory, such that I will need enough disk space for 2X the size of the raster file ( 2x 380Gb)?
So to summarize, I need to read the huge raster into python and plot a CDF. It's very simple with a small raster, but I've been unsuccessful making this plot with the full raster.
I hope this question is clear. Thanks in advance.