Given are an iterator it
over data points, the number of data points we have n
, and the maximum number of samples we want to use to do some calculations (maxSamples
).
Imagine a function calculateStatistics(Iterator it, int n, int maxSamples)
. This function should use the iterator to retrieve the data and do some (heavy) calculations on the data element retrieved.
- if
n <= maxSamples
we will of course use each element we get from the iterator - if
n > maxSamples
we will have to choose which elements to look at and which to skip
I've been spending quite some time on this. The problem is of course how to choose when to skip an element and when to keep it. My approaches so far:
- I don't want to take the first
maxSamples
coming from the iterator, because the values might not be evenly distributed. - Another idea was to use a random number generator and let me create
maxSamples
(distinct) random numbers between0
andn
and take the elements at these positions. But if e.g.n = 101
andmaxSamples = 100
it gets more and more difficult to find a new distinct number not yet in the list, loosing lot of time just in the random number generation - My last idea was to do the contrary: to generate
n - maxSamples
random numbers and exclude the data elements at these positions elements. But this also doesn't seem to be a very good solution.
Do you have a good idea for this problem? Are there maybe standard known algorithms for this?