I'm putting this as an answer since there's more than will fit in a comment, although it may not be complete. There's two suspicious things - first off percentile should run fine on a 20Gb matrix if your machine has 200Gb of ram available. That's a lot of memory, so start looking into what else might be using it. Start with top
- is there another process or is your python program using all of that?
The second suspicious thing is that the documentation for utils.percentile
doesn't match it's actual behavior. Here's the relevant bits from the code you've linked to:
def percentile(matrix, p):
"""
Estimation of percentile without zeros.
....
Returns
-------
float
Calculated percentile.
"""
return np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)
What it actually does is return the (columnwise) percentile calculated for rows which are not all zeros. edit That's rows which contain at least one positive element. If values are non-negative that's the same thing, but in general that will be a very different result.
np.any(matrix > 0, axis=1)
returns a boolean array to index rows which are not all zeros. For example
>>> np.any(array([[3, 4], [0, 0]]) > 0, axis=1)
array([ True, False])
>>> np.any(array([[3, 4], [1, 0]]) > 0, axis=1)
array([ True, True])
>>> np.any(array([[3, 0], [1, 0]]) > 0, axis=1)
array([ True, True])
That array is used to index matrix
, which selects only rows which are not all zeros and returns those.
You should read over the numpy docs for indexing if you aren't familiar with that way of indexing.
Calculating that takes a lot of memory - matrix > 0
creates a boolean array of the same dimension as matrix, then the indexing creates a copy of matrix
which probably contains most of the rows.
So, probably 2-4Gb for the boolean array and close to 20Gb for the copy.
That can be reduced,
## Find rows with all zeros, one row at a time to reduce memory
mask = [np.any(r > 0) for r in matrix]
## Find percentile for each column, excluding rows with all zeros
perc = [np.percentile(c[mask], p) for c in matrix.T]
However, as stated earlier that doesn't match the function documentation.
There may be a reason for this logic, but it is odd.
If you don't know the reason for it you might be fine calling np.percentile
directly - just check that it returns a close value for a smaller subset of your data.
There's also nanpercentile
, which can be used the same way but ignores nan
values.
You can use boolean indexing to replace the values you don't want included with nan
(i.e. matrix[matrix < 0] = np.nan
) and then call that.