0

Given are an iterator it over data points, the number of data points we have n, and the maximum number of samples we want to use to do some calculations (maxSamples).

Imagine a function calculateStatistics(Iterator it, int n, int maxSamples). This function should use the iterator to retrieve the data and do some (heavy) calculations on the data element retrieved.

  • if n <= maxSamples we will of course use each element we get from the iterator
  • if n > maxSamples we will have to choose which elements to look at and which to skip

I've been spending quite some time on this. The problem is of course how to choose when to skip an element and when to keep it. My approaches so far:

  • I don't want to take the first maxSamples coming from the iterator, because the values might not be evenly distributed.
  • Another idea was to use a random number generator and let me create maxSamples (distinct) random numbers between 0 and n and take the elements at these positions. But if e.g. n = 101 and maxSamples = 100 it gets more and more difficult to find a new distinct number not yet in the list, loosing lot of time just in the random number generation
  • My last idea was to do the contrary: to generate n - maxSamples random numbers and exclude the data elements at these positions elements. But this also doesn't seem to be a very good solution.

Do you have a good idea for this problem? Are there maybe standard known algorithms for this?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
navige
  • 2,447
  • 3
  • 27
  • 53
  • no maxSample is just the limit of how many samples we want to look at to do the calculations – navige May 15 '13 at 09:09
  • humm maybe i was not clear. you said you don't want to take the first maxSamples. So my question is do you need to take the samples randomly (as you tried) or can you just skip some samples on a regular basis (for example `n=13`, `max=9` so you skip the 3,6,9 and 12 th sample) ? – Tony Morris May 15 '13 at 09:23
  • @TonyMorris sorry, I got you wrong. Skipping just some samples would also be possible, yes. But how to choose the skipping number? – navige May 15 '13 at 14:55
  • @ChrisCM I think very much this is a question for stack overflow, but of course not an easy one! The 3 solutions I proposed are all not acceptable (otherwise I would not ask) and I think I stated with each solution what their problem is. Yes, as you are stating: "Any collection you can come up with is acceptable", but the question is how to come to that collection! – navige May 15 '13 at 15:03
  • If "any collection you can come up with" is the answer, why not the first few? Hence simple, otherwise you want true pseudo randomness, which is why I posted my answer. Despite of your lack of random access, this is the only way to do it. When you implement a "skipping" type of scenario, you are destined to end up with just "taking the rest" or "ignoring the rest" a lot of times which is no better than just taking the first bunch. Hence, iterating through, creating a temporary random access vector, and picking randomly from that is the only answer that remains, and hence, my posted answer. – MobA11y May 15 '13 at 15:23
  • Well "any" more in the sense of "random any". Which is why solution 1 is not acceptable. – navige May 15 '13 at 15:37

4 Answers4

1

To provide some answer, a good way to collect a set of random numbers given collection size > elements needed, is the following. (in C++ ish pseudo code).

EDIT: you may need to iterate over and create the "someElements" vector first. If your elements are large they can be "pointers" to these elements to save space.

vector randomCollectionFromVector(someElements, numElementsToGrab) {
    while(numElementsToGrab--) {
         randPosition = rand() % someElements.size();
         resultVector.push(someElements.get(randPosition))
         someElements.remove(randPosition);
    }
    return resultVector;
}

If you don't care about changing your vector of elements, you could also remove random elements from someElements, as you mentioned. The algorithm would look very similar, and again, this is conceptually the same idea, you just pass someElements by reference, and manipulate it.

Something worth noting, is the quality of psuedo random distributions as far as how random they are, grows as the size of the distribution you used increases. So, you may tend to get better results if you pick which method you use based on which method results in the use of more random numbers. Example: if you have 100 values, and need 99, you should probably pick 99 values, as this will result in you using 99 pseudo random numbers, instead of just 1. Conversely, if you have 1000 values, and need 99, you should probably prefer the version where you remove 901 values, because you use more numbers from the psuedo random distribution. If what you want is a solid random distribution, this is a very simple optimization, that will greatly increase the quality of "fake randomness" that you see. Alternatively, if performance matters more than distribution, you would take the alternative or even just grab the first 99 values approach.

MobA11y
  • 18,425
  • 3
  • 49
  • 76
  • Random access is unfortunately not possible. We just have an iterator. – navige May 15 '13 at 15:06
  • Yes, hence pseudo code. You don't have to be actually accessing these values, you can just as easily have some temporary object, then simply treating this algorithm as a way to generate the table of values you want access to, which you then iterate over and grab. – MobA11y May 15 '13 at 15:16
  • Ok! But then I don't see how your answer differs from my suggestions above, to be honest (suggestion 2), except my answer is less pseudo-code and more descriptive? – navige May 15 '13 at 15:31
  • Btw, your discussion in the last paragraph is what I also thought of (see again suggestions 2 and 3). Contrary to what you write, however - and that's the whole point of my question - I don't think it is a good idea e.g. in the case of 100 values needing 99 to pick 99 values separately (which you would store and then just pick that element if it comes with the iterator, right?), but I'd rather go with the solution to pick 1 element to throw away. Reason: Having chosen the 98th element, the probability for the 99th is 1% - how many times do you have to try to get an element not already chosen? – navige May 15 '13 at 15:32
  • Two answers to the random number question: A: a random number that avoids already chosen values isn't random... B: that's why you take the temporary vector of pointers, and remove them as you go. It just depends on which behavior you want. Note that as you remove values from someElements, you mod by its size, which is shrinking. This takes care of your "grabbing" duplicates and "difficult random number" generation. – MobA11y May 15 '13 at 15:34
  • Answer A really isn't valid, because it misses the point of the whole discussion. Answer B, however, is fairly valid! Good idea there! – navige May 15 '13 at 15:36
  • Yeah, just remember to make that temporary structure an array of pointers/references, because if your array is large, that's double the memory you need, unless you use pointers. – MobA11y May 15 '13 at 15:38
1
interval = n/(n-maxSamples) //an euclidian division of course
offset = random(0..(n-1)) //a random number between 0 and n-1
totalSkip = 0
indexSample = 0;
FOR it IN samples DO
    indexSample++ // goes from 1 to n
    IF totalSkip < (n-maxSamples) AND indexSample+offset % interval == 0 THEN
        //do nothing with this sample
        totalSkip++
    ELSE
        //work with this sample
    ENDIF
ENDFOR
ASSERT(totalSkip == n-maxSamples) //to be sure

interval represents the distance between two samples to skip. offset is not mandatory but it allows to have a very little diversity.

Tony Morris
  • 435
  • 4
  • 15
  • Collecting values at the same pre-determined interval is no better than taking the first bunch. – MobA11y May 15 '13 at 15:29
  • Well I'd say it's fine to go with a pre-determined interval. Ok, it is not truly randomly distributed, but I'd say that you be better off than with taking just the first bunch (knowing the application we are implementing ;-)). – navige May 15 '13 at 15:39
  • @ChrisCM I know that Collecting values at the same pre-determined interval seems to be similar than taking the first bunch but as the comments under the question (and also the comment of the OP of this answer) suggest it is more acceptable regarding the will of the OP. – Tony Morris May 15 '13 at 15:57
1

Based on the discussion, and greater understanding of your problem, I suggest the following. You can take advantage of a property of prime numbers that I think will net you a very good solution, that will appear to grab pseudo random numbers. It is illustrated in the following code.

#include <iostream>
using namespace std;


int main() {
    const int SOME_LARGE_PRIME = 577;  //This prime should be larger than the size of your data set.  
    const int NUM_ELEMENTS = 100;
    int lastValue = 0;
    for(int i = 0; i < NUM_ELEMENTS; i++) {
        lastValue += SOME_LARGE_PRIME;
        cout << lastValue % NUM_ELEMENTS << endl;
    }
}

Using the logic presented here, you can create a table of all values from 1 to "NUM_ELEMENTS". Because of the properties of prime numbers, you will not get any duplicates until you rotate all the way around back to the size of your data set. If you then take the first "NUM_SAMPLES" of these, and sort them, you can iterate through your data structure, and grab a pseudo random distribution of numbers(not very good random, but more random than a pre-determined interval), without extra space and only one pass over your data. Better yet, you can change the layout of the distribution by grabbing a random prime number each time, again must be larger than your data set, or the following example breaks.

PRIME = 3, data set size = 99. Won't work.

Of course, ultimately this is very similar to the pre-determined interval, but it inserts a level of randomness that you do not get by simply grabbing every "size/num_samples"th element.

MobA11y
  • 18,425
  • 3
  • 49
  • 76
0

This is called the Reservoir sampling

Denis Korzhenkov
  • 383
  • 3
  • 15