I am looking for an algorithm that fairly samples p percent of users from an infinite list of users.
A naive algorithm looks something like this:
//This is naive.. what is a better way??
def userIdToRandomNumber(userId: Int): Float = userId.toString.hashCode % 1000)/1000.0
//An event listener will call this every time a new event is received
def sampleEventByUserId(event: Event) = {
//Process all events for 3% percent of users
if (userIdToRandomNumber(event.user.userId) <= 0.03) {
processEvent(event)
}
}
There are issues with this code though (hashCode may favor shorter strings, modulo arithmetic is discretizing value so its not exactly p, etc.).
Was is the "more correct" way of finding a deterministic mapping of userId
s to a random number for the function userIdToRandomNumber
above?