2

I'm developing an R package which requires me to report percentile ranks for each of the returned values. However, the distribution I have is huge (~10 million values).

The way I'm currently doing it is by generating an ecdf function, saving that function to a file and reading it in the package when needed. This is problematic because the file I save ends up being huge (~120mb) and takes too long to load back in:

f = ecdf(rnorm(10000000))
save(f, file='tmp.Rsav')

Is there anyway to make this more efficient maybe somehow by approximating the percentile rank in R?

Thanks

Omar Wagih
  • 8,504
  • 7
  • 59
  • 75

1 Answers1

2

Just do an ecdf on a downsampled distro:

> items <- 100000
> downsample <- 100 # downsample by a factor of 100
> data <- rnorm(items)
> data.down <- sort(data)[(1:(items / downsample)) * downsample] # pick every 100th
> round(ecdf(data.down)(-5:5), 2)
 [1] 0.00 0.00 0.00 0.02 0.16 0.50 0.84 0.98 1.00 1.00 1.00
> round(ecdf(data)(-5:5), 2)
 [1] 0.00 0.00 0.00 0.02 0.16 0.50 0.84 0.98 1.00 1.00 1.00

Note you probably want to think about the downsampling a little bit as the example here will return slightly biased answers, but the general strategy should work.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • 1
    This is a great solution. I'm thinking maybe generate 1000 samples and pick the sample which gives the least sum squared difference between the approximated and the actual? – Omar Wagih Dec 31 '13 at 21:13
  • 1
    Assuming your primary distribution is fixed, you don't really need to pick multiple samples. The less you downsample the closer to your original distribution you're going to be, but the key here is that you are sampling from the sorted distribution, so every time you sample you should get the same thing. The key think you need to decide is how much precision you need. That will be a function of your downsample size. – BrodieG Jan 01 '14 at 01:11
  • 1
    Also, by thinking about the sampling, I really meant you should make sure it's unbiased. For example, `data.down <- sort(data)[(downsample / 2) + (0:(items / downsample - 1L)) * downsample]` would work better (assuming `downsample` is even). – BrodieG Jan 01 '14 at 01:21