1

I am using the python library:

https://github.com/ficusss/PyGMNormalize

For normalizing my datasets (scRNAseq) and the last line in the library's file utils.py:

https://github.com/ficusss/PyGMNormalize/blob/master/pygmnormalize/utils.py

uses too much of memory:

np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)

Is there a good way of rewriting this line of code to improve the memory usage? I mean I have 200Gb RAM accessible on the cluster and with the matrix of something like 20Gb this line fails to work, but I beliebve there should be a way of making it working.

Nikita Vlasenko
  • 4,004
  • 7
  • 47
  • 87
  • 1
    Here's a question I recently answered that had similar problems with (using `np.median`, which uses percentile internally). Basically you need to take a per-column percentile to keep it from hogging ram, something like "[np.percentile(c, p) for c in cmatrix[np.any(matrix > 0, axis=1)]].T" (https://stackoverflow.com/questions/52102181/optimizing-my-large-data-code-with-little-ram) – user2699 Sep 10 '18 at 22:37
  • Ok, then what is `cmatrix`? Just `matrix.T`? Or just a typo? – Nikita Vlasenko Sep 10 '18 at 23:01
  • I actually changed the line to `[np.percentile(c, p) for c in matrix[np.any(matrix > 0, axis=1)]].T` but it did not fix the memory issue. Probably it should be improved further. – Nikita Vlasenko Sep 10 '18 at 23:50
  • Yup, that's a typo. How are you measuring memory usage? – user2699 Sep 11 '18 at 00:28
  • I needed to also change it to `np.array([np.percentile(c, p) for c in matrix[np.any(matrix > 0, axis=1)]]).T` since `list` does not have `.T` function. But it is still not working: I am getting the memory error. – Nikita Vlasenko Sep 11 '18 at 00:30
  • I am working on the cluster. Already asked the question, so trying to follow what is in the answer, but so far no success: https://stackoverflow.com/questions/52229942/how-to-determine-at-which-point-in-python-script-step-memory-exceeded-in-slurm – Nikita Vlasenko Sep 11 '18 at 00:31
  • Pretty much I can not use `memory profiler` because there is `step memory exceeded` error and so it can not be profiled due to the exception. but I know for sure which line of code is doing that. Just need to change it somehow for it to not use so much of the memory. – Nikita Vlasenko Sep 11 '18 at 00:32
  • are all elements of `matrix >= 0?` – Daniel F Sep 11 '18 at 06:06
  • Ultimately I gave up on using the `PyGMNormalize` after reading another thread on `stackoverflow`: https://stackoverflow.com/questions/52242004/package-to-perform-tmm-normalization-in-python which says that even matrices generated are not the same as the widely used `R` packages methods. I guess the package needs further work/debugging. – Nikita Vlasenko Sep 12 '18 at 19:56

2 Answers2

2

If all elements of matrix are >=0, then you can do:

np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0)

This uses the fact that any float or integer other than 0 is interpreted as True when viewed as a boolean (which np.any does internally). Saves you from building that big boolean matrix seperately.

Since you're boolean indexing in matrix[...], you're creating a temporary copy that you don't really care if it gets overwritten during the percentile process. Thus you can use overwrite_input = True to save even more memory.

mat = matrix.copy()
perc = np.percentile(matrix[np.any(matrix, axis = 1)], p, axis = 0, overwrite_input = True)
np.array_equals(mat, matrix) # is `matrix` still the same?

True

Finally, depending on your other archetecture, I'd recommend looking into making matrix some flavor of scipy.sparse, which should siginficantly reduce your memory usage again (although with some drawbacks depending on the type you use).

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Good answer, but I'm not sure why you're comparing `mat` and `matrix`. Even if they compare as true for some inputs it's not guaranteed when using `overwrite_input=True`. – user2699 Sep 11 '18 at 13:40
  • @user2699 can you show an example where it doesn't work? I can't find any, and the boolean-indexed slice of `matrix` should be a copy, instead of a view to the original `matrix`. So if it is changed, it shouldn't effect `matrix` itself. – Daniel F Sep 11 '18 at 16:45
  • You're right. Since it's operating on a copy the original won't change, I missed that. – user2699 Sep 11 '18 at 16:50
1

I'm putting this as an answer since there's more than will fit in a comment, although it may not be complete. There's two suspicious things - first off percentile should run fine on a 20Gb matrix if your machine has 200Gb of ram available. That's a lot of memory, so start looking into what else might be using it. Start with top - is there another process or is your python program using all of that?

The second suspicious thing is that the documentation for utils.percentile doesn't match it's actual behavior. Here's the relevant bits from the code you've linked to:

def percentile(matrix, p):
    """
    Estimation of percentile without zeros.
    ....
    Returns
    -------
    float
        Calculated percentile.
    """
    return np.percentile(matrix[np.any(matrix > 0, axis=1)], p, axis=0)

What it actually does is return the (columnwise) percentile calculated for rows which are not all zeros. edit That's rows which contain at least one positive element. If values are non-negative that's the same thing, but in general that will be a very different result.

np.any(matrix > 0, axis=1) returns a boolean array to index rows which are not all zeros. For example

>>> np.any(array([[3, 4], [0, 0]]) > 0, axis=1)
    array([ True, False])

>>> np.any(array([[3, 4], [1, 0]]) > 0, axis=1)
    array([ True,  True])

>>> np.any(array([[3, 0], [1, 0]]) > 0, axis=1)
    array([ True,  True])

That array is used to index matrix, which selects only rows which are not all zeros and returns those. You should read over the numpy docs for indexing if you aren't familiar with that way of indexing.

Calculating that takes a lot of memory - matrix > 0 creates a boolean array of the same dimension as matrix, then the indexing creates a copy of matrix which probably contains most of the rows.
So, probably 2-4Gb for the boolean array and close to 20Gb for the copy.

That can be reduced,

## Find rows with all zeros, one row at a time to reduce memory
mask = [np.any(r > 0) for r in matrix]  
 ## Find percentile for each column, excluding rows with all zeros
perc = [np.percentile(c[mask], p) for c in matrix.T] 

However, as stated earlier that doesn't match the function documentation.

There may be a reason for this logic, but it is odd. If you don't know the reason for it you might be fine calling np.percentile directly - just check that it returns a close value for a smaller subset of your data. There's also nanpercentile, which can be used the same way but ignores nan values.
You can use boolean indexing to replace the values you don't want included with nan (i.e. matrix[matrix < 0] = np.nan) and then call that.

user2699
  • 2,927
  • 14
  • 31
  • Still can't fix it. The error persists. I tried using `sacct --format="JobID,Elapsed,MaxVMSize`, and it gives the memory `205Gb` for the job. – Nikita Vlasenko Sep 11 '18 at 03:05
  • Right, so this line probably isn't the one causing memory problems. It's just the line where things finally fail because some part of the process is hogging memory. Did you run `top` and what did it show? – user2699 Sep 11 '18 at 13:37
  • I did but it is the SLURM cluster, and it did not show anything relevant I think – Nikita Vlasenko Sep 11 '18 at 16:36