1

I have data whose mean and variance changes as a function of the independent variable. How do I convert the dependent variable into (estimated) conditional percentage ranks?

For example, say the data looks like Z below:

library(dplyr)
library(ggplot2)

data.frame(x = runif(1000, 0, 5)) %>%
  mutate(y = sin(x) + rnorm(n())*cos(x)/3) ->
  Z

we can plot it with Z %>% ggplot(aes(x,y)) + geom_point(): it looks like a disperse sine function, where the variance around that sine function varies with x. My goal is to convert each y value into a number between 0 and 1 which represents its percentage rank for values with similar x. So values very close to that sine function should be converted to about 0.5 while values below it should be converted to values closer to 0 (depending on the variance around that x).

One quick way to do this is to bucket the data and then simply compute the rank of each observation in each bucket.

Another way (which I think is preferable) to do what I ask is to perform a quantile regression for a number of different quantiles (tau):

library(quantreg)
library(splines)

model.fit <- rq(y ~ bs(x, df = 5), tau = (1:9)/10, data = Z)

which can be plotted as follows:

library(tidyr)

data.frame(x = seq(0, 5, len = 100)) %>%
  data.frame(., predict(model.fit, newdata = .), check.names = FALSE) %>%
  gather(Tau, y, -x) %>% 
  ggplot(aes(x,y)) + 
  geom_point(data = Z, size = 0.1) +
  geom_line(aes(color = Tau), size = 1)

Given model.fit I could now use the estimated quantiles for each x value to convert each y value into a percentage rank (with the help of approx(...)) but I suspect that package quantreg may do this more easily and better. Is there, in fact, some function in quantreg which automates this?

banbh
  • 1,331
  • 1
  • 13
  • 31
  • for computing **percentile** ranks, see [this question](https://stackoverflow.com/questions/21219447/calculating-percentile-of-dataset-column); the base `quantile()` function; and [this dplyr vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) on window functions. – lefft Nov 30 '17 at 17:28
  • @lefft: I agree. In fact what you mention is exactly what I was suggesting when I said "One quick way to do this is to bucket the data and then simply compute the rank of each observation in each bucket". The problem with this approach is that forces one to bucket the data. The `quantreg` approach is (IMO) preferable since it allows one to use, say, splines rather than discrete buckets. – banbh Nov 30 '17 at 17:52

0 Answers0