2

I'd like to generate a distribution in R given the following score and percentile ranks.

x <- 1:10
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)

PercRank = 1 for example tells that 1% of the data has a value/score <= 1 (the first value of x). Similarly, PercRank = 7 tells that 7% of the data has a value/score <= 2 etc..

I am not aware of how one could find the underlying distribution. I'd be glad if I could get some guidance on how to go about obtaining the pdf of the underlying distribution from just this much information.

Rudi Visser
  • 21,350
  • 5
  • 71
  • 97
maycobra
  • 417
  • 7
  • 15
  • 7
    [What have you tried?](http://whathaveyoutried.com) –  Jan 27 '13 at 12:43
  • @Arun: The answer you provided is clearly directed to a different question than this. The values you provide do not have a domain of support in the range of 1:10. – IRTFM Jan 27 '13 at 16:44
  • 1
    @Arun: Looks more on point for the posted question. – IRTFM Jan 27 '13 at 21:12

3 Answers3

9

From Wikipedia:

The percentile rank of a score is the percentage of scores in its frequency distribution that are the same or lower than it.

In order to illustrate this, let's create a distribution, say, a normal distribution, with mean=2 and sd=2, so that we can test (our code) later.

# 1000 samples from normal(2,2)
x1 <- rnorm(1000, mean=2, sd=2)

Now, let's take the same percentile rank you've mentioned in your post. Let's divide it by 100 so that they represent cumulative probabilities.

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100

And what are the values (scores) corresponding to these percentiles?

# generating values similar to your x.
x <- c(t(quantile(x1, cum.p)))
> x
 [1] -2.1870396 -1.4707273 -1.1535935 -0.8265444 -0.2888791  
         0.2781699  0.5893503  0.8396868  1.4222489  2.1519328

This means that 1% of the data is lesser than -2.18. 7% of the data is lesser than -1.47 etc... Now, we have the x and cum.p (equivalent to your PercRank). Let's forget x1 and the fact that this should be a normal distribution. To find out what distribution it could be, let's get actual probabilities from the cumulative probabilities by using diff that takes the difference between nth and (n-1)th element.

prob <- c( cum.p[1], diff(cum.p), .01)
> prob
# [1] 0.01 0.06 0.05 0.11 0.18 0.21 0.11 0.07 0.12 0.07 0.01

Now, all we have to do is is to generate samples of size, say, 100 (could be any number), for each interval of x (x[1]:x[2], x[2]:x[3] ...) and then finally sample from this huge data as many number of points as you need (say, 10000), with probabilities mentioned above.

This can be done by:

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 5) 
fin  <- abs(max(x)) + 5

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

Now, we have 10000 samples from the distribution. Let's look at how it is. It should resemble a normal distribution with mean = 2 and sd = 2.

> hist(out)

normal_dist

> c(mean(out), sd(out))
# [1] 1.954834 2.170683

It is a normal distribution (from the histogram) with mean = 1.95 and sd = 2.17 (~ 2).

Note: Some things what I've explained may have been roundabout and/or the code "may/may not" work with some other distributions. The point of this post was just to explain the concept with a simple example.

Edit: In an attempt to clarify @Dwin's point, I tried the same code with x = 1:10 corresponding to OP's question, with the same code by replacing the value of x.

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100
prob <- c( cum.p[1], diff(cum.p), .01)
x <- 1:10

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 1) 
fin  <- abs(max(x)) + 1

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

> quantile(out, cum.p) # ~ => x = 1:10
# 1%     7%    12%    23%    41%    62%    73%    80%    92%    99% 
# 0.878  1.989  2.989  4.020  5.010  6.030  7.030  8.020  9.050 10.010 

> hist(out)

hist_OPs_data

Arun
  • 116,683
  • 26
  • 284
  • 387
  • @Arun, is it possible that the minus is wrong in `init <- -(abs(min(x)) + 5)`. For a sample with all positive value, it doesn't seem to work. Couldn't it just be `init <- min(x) - 5`? – Stefan Oct 16 '18 at 08:24
1

i think you want the ecdf function, which is mentioned as the inverse of the quantile function on the ?quantile help page..

# construct your vector containing the data
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)

# construct an empirical cumulative distribution function
# which is really just the `inverse` of `quantile
Fn <- ( ecdf( PercRank ) )
# note that the `ecdf` function returns a function itself.

# calculate what percent of `PercRank` is below these integers..
Fn( 0 )
Fn( 1 )
Fn( 2 )
Fn( 3 )
Fn( 6 )
Fn( 7 )
Fn( 8 )


# re-construct your `x` vector using PercRank
Fn( PercRank ) * 10
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77
-1

This give produces a dataset that would have the features that you specify. If you wanted more "randomness" you could subtract some random number within the range of the percentile span to rep result inside the anonymous function:

   > mapply( function(x,y) rep(x, each=y), (x),  diff(c(PercRank, 100) ) )
[[1]]
[1] 1 1 1 1 1 1

[[2]]
[1] 2 2 2 2 2

[[3]]
 [1] 3 3 3 3 3 3 3 3 3 3 3

[[4]]
 [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[[5]]
 [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

[[6]]
 [1] 6 6 6 6 6 6 6 6 6 6 6

[[7]]
[1] 7 7 7 7 7 7 7

[[8]]
 [1] 8 8 8 8 8 8 8 8 8 8 8 8

[[9]]
[1] 9 9 9 9 9 9 9

[[10]]
[1] 10
IRTFM
  • 258,963
  • 21
  • 364
  • 487