0

I am looking for an algorithm that fairly samples p percent of users from an infinite list of users.

A naive algorithm looks something like this:

//This is naive.. what is a better way??
def userIdToRandomNumber(userId: Int): Float = userId.toString.hashCode % 1000)/1000.0

//An event listener will call this every time a new event is received
def sampleEventByUserId(event: Event) = {
    //Process all events for 3% percent of users
    if (userIdToRandomNumber(event.user.userId) <= 0.03) {
        processEvent(event)
    }
}

There are issues with this code though (hashCode may favor shorter strings, modulo arithmetic is discretizing value so its not exactly p, etc.).

Was is the "more correct" way of finding a deterministic mapping of userIds to a random number for the function userIdToRandomNumber above?

anthonybell
  • 5,790
  • 7
  • 42
  • 60

3 Answers3

1

Here is a very simple mapping, assuming your dataset is large enough:

This is a practically used method on large datasets, and gives you entirely random results!

I am hoping you can easily code this in Scala.


EDIT: In the comments, you mention deterministic. I am interpreting that to mean if you sample again, it gives you the same results. For that, simply store x for each user.

Also, this will work for any number of users (even infinite). You just need to generate x for each user once. The mapping is simply userId -> x.

EDIT2: The algorithm in your question is biased. Suppose p = 10%, and there are 1100 users (userIds 1-1100). The first 1000 userIds have a 10% chance of getting picked, the next 100 have a 100% chance. Also, the hashing will map user ids to new values, but there is still no guarentee that modulo 1000 would give you a uniform sample!

xyz
  • 3,349
  • 1
  • 23
  • 29
  • thanks for the quick reply, but my question is specifically how do I map `userId -> [0, 1]` in a completely random way (albeit, the same user should always map to the same value). I do not know in advance what the userIds are so I need a deterministic way to do this mapping. – anthonybell Nov 16 '16 at 21:09
  • @anthonybell You said randomly sample? By deterministic, do you mean same samples if you rerun? – xyz Nov 16 '16 at 21:11
  • The number of users is potentially infinite since it is an endless stream of users. – anthonybell Nov 16 '16 at 21:14
  • @anthonybell It works for that as well. This algo just requires you to process every user once! Also, I updated the answer to clarify the mapping criteria. – xyz Nov 16 '16 at 21:17
  • The question is "how to generate the random number?" that samples fairly. I already have an algorithm in my question that does what you are describing, I am looking for one that does not have the biases that I mentioned in the question. – anthonybell Nov 16 '16 at 21:18
  • @anthonybell How is this biased? Every user has a fair chance *p* of getting sampled. The algorithm in your question is however biased. I added why in my answer as well. – xyz Nov 16 '16 at 21:20
  • I have already explained why my algorithm is biased "hashCode may favor shorter strings", but your logic in EDIT2 is incorrect, notice the `1001.toString.hashCode = 1507424`, not `= 1001`. – anthonybell Nov 16 '16 at 21:39
  • @anthonybell You missed my point. You don't need to generate a random number from a user id. A pure random number works. – xyz Nov 16 '16 at 21:44
  • I think your algorithm is assuming your list of users is small enough that you can keep a table of mappings of the form `userId -> x`. This is wasteful of memory and not feasible if your user list is infinite. Also this is a problem in a distributed environment such as Kafka because now all your nodes will store different numbers for the same user, it would have to somehow sync all the tables across nodes. – anthonybell Nov 16 '16 at 21:54
  • @anthonybell It works very well for any distributed environment. Just store the random number with the user id in the database. And since you list is too large, you would be storing a bunch of doubles **on disk**. That's actually very little memory. Here's another way to look at it: for every user prepend the random number (say 0-99 instead of a float) to the user id. like `newid = [0-99][origUserId]`. – xyz Nov 16 '16 at 21:57
1

Try the method(s) below instead of the hashCode. Even for short strings, the values of the characters as integers ensure that the sum goes over 100. Also, avoid the division, so you avoid rounding errors

  def inScope(s: String, p: Double) = modN(s, 100) < p * 100

  def modN(s: String, n: Int): Int = {
    var sum = 0
    for (c <- s) { sum += c }
    sum % n
  }
radumanolescu
  • 4,059
  • 2
  • 31
  • 44
0

I have come up with a deterministic solution to randomly sample users from a stream that is completely random (assuming the random number generator is completely random):

def sample(x: AnyRef, percent: Double): Boolean = {
    new Random(seed=x.hashCode).nextFloat() <= percent
}

//sample 3 percent of users
if (sample(event.user.userId, 0.03)) {
    processEvent(event)
}
anthonybell
  • 5,790
  • 7
  • 42
  • 60