I have a table with 4*10^8(roughly) records, and I want to get a 4*10^6(exactly) sample of it.
But my way to get the sample is somehow special:
- I select 1 record from the 4*10^8 record randomly(every record has the same probability to be select).
- repeat step 1 4*10^6 times(no matter if one record be selected multiple times).
I think up a method to solve this:
- Generate a table
A(num int)
, and there only one number in every record of tableA
which is random integer from 1 to n(n is the size of my original table, roughly 4*10^8 as mentioned above). - Load table
A
as resource file to every map, and if the ordinal number of the record which is on decision now is in tableA
, output this record, otherwise discard it.
I think my method is not so good because if I want to sample more record from the original table, the table A
will became very large and can't be loaded as resource file.
So, could any one please give an elegant algorithm?