0

I need to compute TheilSen regression slope and intercept between values of two rasters in a GRASS GIS python script. The two rasters in this example (xtile and ytile) are both of the same dimensions 250x250 pixels and contain nodata (null) values. I have so far used grass.script alone, so I am new to scipy. I tried to read some tutorials and based on that came up with following code I tried on command-line:

>>> from scipy import stats
>>> import grass.script as grass
>>> import grass.script.array as garray
>>> x = garray.array(mapname="xtile")
>>> y = garray.array(mapname="ytile")
>>> res_list = stats.theilslopes(y, x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/scipy/stats/_stats_mstats_common.py", line 214, in theilslopes
    deltax = x[:, np.newaxis] - x
MemoryError

Evidently, it would not be so simple. Edit: I removed my thoughts about the problem being dimensionality of the array, I was wrong. Now it seems the 250x250 array size is simply too much. Is it so? Any idea how to evade this?

Then there is another problem it seems. When I try to print the array x,

>>> print x
[[  0.   0.   0. ...   0.   0.   0.]
 [  0.   0.   0. ...   0.   0.   0.]
 [  0. 402.   0. ...   0.   0.   0.]
 ...
 [  0.   0.   0. ...   0.   0.   0.]
 [  0.   0.   0. ...   0.   0.   0.]
 [  0.   0.   0. ...   0.   0.   0.]]

it appears all the nodata values from the raster got read as zeroes into the array. In the raster in question there is majority of nodata (or null as named in GRASS) pixels, which should be ignored in the regression, ie. if any value within the rasters x or y is nodata, corresponding pair of x,y data should not be used in the regression computation. Is it possible to define nodata values in the arrays so that these would be ignored directly in the manner described, or is it needded to first filter out the nodata pairs from the pair of arrays?

Thank you.

Tomas_IV
  • 101
  • 2

1 Answers1

0

I am not sure it is appropriate to answer my own question, but maybe it would be useful for others with similar problem.

I was able to finally make it work. The working (i.e. it computes something) modified example solves both problems mentioned in original question:

>>> from scipy import stats
>>> import grass.script as grass
>>> import grass.script.array as garray
>>> x = garray.array(mapname="xtile").reshape(-1)
>>> y = garray.array(mapname="ytile").reshape(-1)
>>> # Now the null raster values are changed to zeroes. 
>>> # Let us filter them out by pairs of x, y values for any pair which contains a zero value.
>>> xfiltered = np.array([])
>>> yfiltered = np.array([])
>>> i = 0
>>> for xi in np.nditer(x):
...    if x[i] > 0:
...       if y[i] > 0:
...          xfiltered = np.append(xfiltered, [x[i]])
...          yfiltered = np.append(yfiltered, [y[i]])
...    i += 1
... 
>>> # Compute the regression.
>>> res_list = stats.theilslopes(yfiltered, xfiltered)
>>> res_list
(0.8738738738738738, -26.207207207207148, 0.8327338129496403, 0.9155844155844156)

Explanation: I filtered out all zero values which were nodata in the original rasters (and possibly negative values as well, there should not be any and if they were, it would mean defective data - physical meaning of the data is quantized reflectance) before regression computation. The use of reshape was not necessary for the stats.theilslopes, as I experimentally verified with a test data, but it makes the filtering of the two arrays much easier.

Now, I am still not sure why the filtering was needed for the stats.theilslopes to finish without errors (although with all the zeroes in the data the results would be wrong anyway). It could be, that filtering out the zeroes just made the the set small enough to fit in memory, but i do not believe so. I think that the majority zero values make it impossible to compute median slope of the pairs of x,y points, since in case majority of points have identical x,y values, then also majority of their pairs have undefined slope and median is about the majority. But it could be something entirely else.

Also, as I am not very skilled Python programmer, maybe the way I did this is not the most efficient one. That could be corrected by others.

Last remark, I probably have the x and y data reversed, if y are the dependent and x the independent variables. Intuitively I felt they should be in y,x order, but I see in all the tutorials and docs it is always x,y. I left it as it was in the original question, since you can search the inverse formula x=f(y) regression line in some cases and this is not part of the problem I was trying to solve.

Tomas_IV
  • 101
  • 2