R select data of each row on basis of the row quantiles

Question

I have a data set with 60 rows and 3036 columns. I have already calculated the row quantiles with the function rowQuantiles from the matrixStats package. From this I got a column vector [60,1]. Now, I want to select from each row only the data which is higher than the row quantile. If I use the which function as following:

datset_qu95 = which(dataset > rowQuantiles(dataset, probs=c(0.95))

then I loose the data dimensions and get only an array instead of a matrix with the following dimensions [60,152].

Can somebody help me?

Thank you!

Your suggestion shows me only how to calculate the 95 quantile on a other way. But my aim is to select afterwards only the data of each row which is over the row quantile. — user2882752, Nov 19 '13 at 10:30

IRTFM · Accepted Answer · 2013-11-19T16:31:35.890

0

I don't think a rowQuantile function is needed. Just pick out the highest values up to a probability threshold: (edit note (first version had incorrect index expression)

> apply( dat, 1, function(x) x[order(x)][1:( (1-0.95)*ncol(dat))])
    obs1     obs2     obs3 
 11.5379 856.3470 136.8860

And as always, because R matrices are column oriented, you will probably want to use t() on the result to get this back into the row orientation you expect.

To your comment: Fixed it so it picks up the highest values rather than the lowest values:

 apply( dat, 1, function(x)
                  x[order(x, decreasing=TRUE)][1:( (1-0.95)*ncol(dat))])

edited Nov 19 '13 at 16:31

answered Nov 18 '13 at 16:58

IRTFM

258,963
21
364
487

This function does not work as wished because it does not select the data of each row which is higher than the row quantile – user2882752 Nov 19 '13 at 10:27
It should now pick up the values higher than the 0.95th row quantile. – IRTFM Nov 19 '13 at 16:32

SESman · Answer 2 · 2013-11-19T18:37:25.820

0.05 * 3036 = 151.8 but selecting in each row the values greater than the 95% quantile does not mean that you will have systematically 152 values. If you want to keep your object dimensions you can try to replace undesired values with NA's.
As your object is not huge you could also work with data frame objects and have your observations along the row dimension.

library(matrixStats)

# To extract your values...
myfun <- function(k, q){x[k, x[k,] > q]}
x <- matrix(sample(1:100, 60*3036, replace=TRUE), ncol=3036)
xx <- mapply(myfun, seq(along=x[,1]), rowQuantiles(x, probs=.95))
# xx is a list, xx[[1]] contains the values of x[1,] > quantile(x[1, ], .95)

# The number of selected values depends on their distribution - with NORM should be stable
x11() ; par(mfrow=c(2,1))
hist(sample(1:100, 60*3036, replace=TRUE)) # UNIF DISTRIB
n.val <- sapply(xx, length)
hist(n.val, xlab="n.val > q_95%")
abline(v=152, col="red", lwd=5)

# Assuming you want the same number of value for each row
n <- min(n.val)
myfun <- function(x){sample(x, n)} # Representative sample - Ordering is possible but introduce bias. Depends on your goals
xx <- t(sapply(xx, myfun))
dim(xx) # 60 n

A row-oriented function that picks based on an ordering could have exactly 152 items. If the values are selected from a continuous distribution there would be a very small chance of ties, anyway. If dealing with a distribution having ties in the upper tail, using the `order` function allows you to break the ties in a sensible manner. — IRTFM, Nov 19 '13 at 16:36

R select data of each row on basis of the row quantiles

2 Answers2