4

I wrote c++ code to calculate 119 quantiles (from 10^-7 to 1 - 10^-7) of 100 millions of double precision numbers. My current implementation stores the numbers in a vector and then it sorts the vector. Is there any way to calculate the quantiles without storing the numbers?

Thank you

ADDENDUM (sorry for my English): Here is what I'm doing:

1) generate 20 uniformly distributed random numbers in [0, 1)

2) I feed those numbers into an algorithm that outputs a random number with unknown mean and unknown variance

3) store the number at step 2

repeat 1, 2 and 3 100 millions of times (now I collected 10^8 random numbers with unknown mean and unknown variance).

Now I sort those numbers to calculate 119 quantiles from 10^-7 to 1 - 10^-7 using the formula "R-2, SAS-5": https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample

Since the program is multi-threaded, the memory allocation is too big and I can only use 5 threads instead of 8.

Cristiano
  • 141
  • 6
  • 3
    But if you don't store the numbers how will you retrieve them later? What exactly are you trying to do? – RedX Dec 26 '15 at 13:40
  • 2
    There's a well known way to find the median of a distribution by using heaps. See if your specific problem is adaptable to something similar? – Carlos Dec 26 '15 at 13:44
  • @Carlos But you'd need to store the numbers in the heap then, no? – interjay Dec 26 '15 at 13:47
  • 1
    Do you mean "without storing" or "without sorting"? – arekolek Dec 26 '15 at 13:49
  • Yeah, you would. It's hard to see how this would be solvable without storing the numbers, as you're essentially looking at a histogram and deciding where to cut it. One thing I hadn't considered is whether this is an online algorithm or a one-off? – Carlos Dec 26 '15 at 13:49
  • 1
    @RedX: computing the min/max of a set can be done without storing the numbers. This question is about a generalization. –  Dec 26 '15 at 13:49
  • @YvesDaoust I'm aware of that. But the question still remains, if he does not store the calculated quantiles how will he use them later? Plus the second frase is ambiguous. Does he mean by numbers the quantiles or the input data set? – RedX Dec 26 '15 at 14:39
  • @Cristiano: what can you tell us about the statistical distribution of the numbers ? –  Dec 26 '15 at 14:57
  • @RedX: little doubt that the issue is storing the 100 000 000 numbers, not the 119 quantiles. By the way, the OP uses "numbers" to denote the data set (vector) and "quantiles" for the desired results. There is no ambiguity. –  Dec 26 '15 at 15:00
  • @RedX: I don't need to retrieve those numbers later; I only need to calculate and store the 119 quantiles. – Cristiano Dec 26 '15 at 15:19
  • @arekolek: I mean "without storing". The problem is the big allocation of memory for 10^8 double precision numbers. – Cristiano Dec 26 '15 at 15:23

2 Answers2

4

This is a problem from the field of streaming algorithms (where you need to operate on a stream of data without storing each element).

There are well known algorithms for quantile stream algorithms (e.g., here), but if you are willing to use quantile approximations, it's a fairly easy problem. Simply use reservoir sampling to uniformly sample m out of n elements, and calculate the quantiles on the sample (by the method you did: storing the m samples in a vector, and sorting it). The size m influences the approximation's precision (see, e.g., here).

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • I'm not sure to understand "ency.pdf", because it seems that it suggests to store a subsample of size m, but I generate 10^8 random numbers because I need the best possible estimate of the quantiles. I also tried the q-digest algorithm but also in this case it "compress" the sample. Is there any simple procedure that uses *all* the 10^8 numbers? – Cristiano Dec 27 '15 at 12:12
  • @Cristiano I'll have a look at it a bit later. – Ami Tavory Dec 27 '15 at 13:08
2

You need to know the set of numbers before you can calculate the quantiles.

This can either be done by storing the numbers, but you can also make/use a multi-pass algorithm, that learns a little part each run.

There are also approximate one-pass algorithms for this problem, if some inaccuracy on the quantiles is acceptable. Here is an example: http://www.cs.umd.edu/~samir/498/manku.pdf

EDIT** Forgot, if your numbers have many duplicates, you just need to store the number and how many times it appears, not each duplicate. Depending on the input data this can be a significant difference.

Koebmand STO
  • 171
  • 9