Split data by row in R in quantiles

Question

I have a data.frame called fd with 406 rows and 48 columns. For each row in fd I want to compute ntiles (sixtiles). I do this the following way:

quant <- apply(fd, 1,  function(x) quantile(t(x), probs = c(1/6, 2/6, 0.5, 4/6, 5/6), na.rm = TRUE ))

What I now want to do, is split my original data into 6 new dataframes, i.e. fd1 to fd6, where in fd1 I have all the observations of the first sixtile, in fd2 I have all the observations of the second sixtile and so on. Again, I want to do this rowwise. Meaning, I want my algorithm/function to look at each row of fd and do the following:

Take all the observations of the first sixtile, in the first row of fd, and store them into the first row of fd1, then take the first sixtile of the second row of fd and store them in the second row of fd1.

Important to note: I do not have observations for each row and column, so in some I have missing data (NA)

Could anybody give hints on how I can achieve this? Thanks in advance.

I think this depends a bit on how you want your quantile bins to work and how you handle ties. As a workaround to using `quantile`, consider using `sort` and `cut` to get 8 observations in each sixtile. — Michael Roswell, Nov 12 '19 at 21:31
I do have missing data (NA), so simply cutting the data unfortunately doesn't work, because I do not have 48 observations in each row. What to you mean by "How I want to handly my ties?" — Friederike, Nov 12 '19 at 21:38
It seems to me that your `probs` are inaccurate. Shouldn't it be `probs = c(1/6, 2/6, 3/6, 4/6, 5/6)` instead? With `2/3` you are repeating `4/6` twice! — Vitali Avagyan, Nov 12 '19 at 22:56
What would you want your output to look like for a row that was e.g. `c(rep(1,9), rep(4,23), rep(12, 16))`? — Michael Roswell, Nov 13 '19 at 03:07

score 1 · Answer 1 · answered Nov 12 '19 at 22:46

Old school solution using matrix, list and nested loops.

# some artifical data with missings
set.seed(123)
fd <- data.frame(matrix(rnorm(406*48), nrow = 406, ncol = 48))
diag(fd) <- NA

# quant
quant <- apply(fd, 1,  function(x)
  quantile(t(x), probs = (0:6)/6, na.rm = TRUE, type = 6)
)

#matrix with selection
res <- list()
for (i in 1:6) {
  mm <- matrix(NA, nrow = nrow(fd), ncol = ncol(fd)/6)
  for (j in 1:nrow(fd)) {
    lwr <- (quant[(i),j] < fd[j,])
    upr <- (fd[j,] <= quant[(i+1),j])
    if (i == 1)
      z_j <- fd[j,][ upr ]
    else
      z_j <- fd[j,][ lwr & upr ]

    z_j <- z_j[!is.na(z_j)]
    mm[j,1:length(z_j)] <- sort(z_j)
  }
  res[[i]] <- mm
}
rm(i, mm, j, lwr, upr)

fd1 <- res[[1]]

Another speedy approach might be ```ii <- t(sapply(seq(length=nrow(fd)), function(x) findInterval(fd[x,], quant[,x], left.open = TRUE))) ``` — Jan Marvin, Nov 13 '19 at 08:00

score 0 · Answer 2 · answered Nov 13 '19 at 00:40

Here is a relatively shorter way of achieving this using purrr and dplyr packages:

library(dplyr)
library(purrr)

# some random example
df <- data.frame(matrix(runif(48),405,48))
df[3,5] <- NA
df[10,25:26] <- NA

quant <- apply(df, 1,  function(x) aa <- quantile(t(x), probs = c(1/6, 2/6, 3/6, 4/6, 5/6), na.rm = TRUE ))

aa <- as.data.frame(t(df))
fd1 <- map2(quant[1,],aa,function(x,y) y[y <= x] %>% .[!is.na(.)]) %>% 
  do.call(rbind,.)%>% as.data.frame(.)
fd2 <- pmap(list(quant[1,],quant[2,],aa),function(x,y,z) z[z > x & z <= y] %>% .[!is.na(.)]) %>% 
  do.call(rbind,.) %>% as.data.frame(.)
fd3 <- pmap(list(quant[2,],quant[3,],aa),function(x,y,z) z[z > x & z <= y] %>% .[!is.na(.)]) %>% 
  do.call(rbind,.) %>% as.data.frame(.)
fd4 <- pmap(list(quant[3,],quant[4,],aa),function(x,y,z) z[z > x & z <= y] %>% .[!is.na(.)]) %>% 
  do.call(rbind,.) %>% as.data.frame(.)
fd5 <- pmap(list(quant[4,],quant[5,],aa),function(x,y,z) z[z > x & z <= y] %>% .[!is.na(.)]) %>% 
  do.call(rbind,.) %>% as.data.frame(.)
fd6 <- map2(quant[5,],aa,function(x,y) y[y > x & y <= max(y)] %>% .[!is.na(.)]) %>% 
  do.call(rbind,.) %>% as.data.frame(.)

NB: There are some duplicate values in the final fd1 - fd6 data frames (which is not, by the way, the best format to store values for this type of problem) but you can always filter them out by using for example unique.

Hope this helps. Any modification to the answer is welcomed.

Thanks! How would you recommend I store the values? I am only familiar with data.frames. — Friederike, Nov 13 '19 at 01:35
Well, it depends on what you are going to do with the final output. Perhaps you can leave them in a list. — Vitali Avagyan, Nov 13 '19 at 01:44
See `purrr::map_dfr` for a way to store stuff in dataframes, and consider a more general approach that can be applied (perhaps using `map_dfr`) over the elements of `quantile` instead of re-writing the basically same function each time — Michael Roswell, Nov 13 '19 at 03:12
This answer might also be helpful: https://stackoverflow.com/a/35454978/8400969 — Michael Roswell, Nov 13 '19 at 03:13

Split data by row in R in quantiles

2 Answers2