0

I'm running python code that's similar to:

import numpy

def get_user_group(user, groups):
    if not user.group_id:
        user.group_id = assign(groups)
    return user.group_id

def assign(groups):
    for group in groups:
        ids.append(group.id)
        percentages.append(group.percentage) # e.g. .33

    assignment = numpy.random.choice(ids, p=percentages)
    return assignment

We are running this in the wild against tens of thousands of users. I've noticed that the assignments do not respect the actual group percentages. E.G. if our percentages are [.9, .1] we've noticed a consistent hour over hour split of 80% and 20%. We've confirmed the inputs of the choice function are correct and mismatch from actual behavior.

Does anyone have a clue why this could be happening? Is it because we are using the global numpy? Some groups will be split between [.9, .1] while others are [.33,.34,.33] etc. Is it possible that different sets of groups are interfering with each other?

We are running this code in a python flask web application on a number of nodes.

Any recommendations on how to get reliable "random" weighted choice?

Layla
  • 347
  • 1
  • 6
  • 13
  • With just a couple of functions like this, this problem isn't reproducible. But generating one random value at a time is, even if correct, inefficient. `np.random` functions are best when you ask for a large `size`, many values at a time. And with `choice` there's the option of `replace` or not. I'm not following you problem enough to say whether these factors affect your values, but I think you should reevaluating your approach to be more more `numpy` optimal. – hpaulj Sep 20 '21 at 23:33
  • @hpaulj my team has load tested this code in a few different environments and can not reproduce either. My intention was less to ask for debugging via reproduction and rather reasons why this might be happening in a live python web application given the properties of numpy. Alternatively, tips on what to use to generate single random numbers out of a weighted selection of options in such an environment would be helpful as well. We're focused on correct functionality first and efficiency later. We have considered using `random.choices` instead which we are testing out now. – Layla Sep 21 '21 at 19:40
  • Just an update here that python's `random.choices` method worked as expected and that is what we are using for this task. – Layla Nov 23 '21 at 19:27

1 Answers1

1

This comment exhausted the limitations of a comment, hence I post it here.

The fact that your team was not able to reproduce the problem but got proper results is a sign that most probably NumPy can suit your needs. You can benefit from NumPy later, when you need efficiency, and it can be seen that efficiency is not your concern now.

A more complete code and infrastructure setup on your nodes would be helpful though. How often do you restart your Flask server? Where do you initialize the NumPy random generator? Consider the following code that creates a page /random which can be customized with size, e.g: localhost:5000/random?size=20:

from flask import Flask, request
import numpy
import pandas

... # your webapp

numpy.random.seed(0)

@app.route('/random', methods=['GET'])
def random():
    """Gives the desired number of random numbers
    with the state of the random number generator.
    """
    # DON'T PUT numpy.random.seed(0) HERE
    size = request.args.get('size')
    
    if size is not None:
        size = int(size)
    else:
        size = 1

    state = numpy.random.get_state()
    data = numpy.random.random(size=size)

    table = pandas.DataFrame(data=data)

    return table.to_html() + repr(state)

In this example, the state is initialized once after the Flask app is started. Whenever the /random page is requested, good random numbers are generated.

If you put the state initialization inside the function, it would surely cause unexpected distributions, bc you'll get the same random numbers (and same choices).

If you use multiple nodes and initialize with the same seed, your different nodes will produce the same choice again. In this case, use the unique node ids as seed values. If you restart the servers often, concatenate the restart ID or timestamp to the unique node ID. It is also a good idea to ensure that the timestamp is logged.

DanielTuzes
  • 2,494
  • 24
  • 40
  • thanks so much for this detailed breakdown. We were actually not seeding numpy.random at all which is possible the root cause of the issue. What we did end up doing that worked however was using python's random library. That worked beautifully for our use case. – Layla Nov 16 '21 at 16:49
  • for future readers, answers to your questions are as follows: How often do you restart your Flask server? - They restart upon redeploy, likely weekly. Where do you initialize the NumPy random generator? We didn't! We didn't call seed to avoid reproducibility (though I realize we could have been more creative there) – Layla Nov 16 '21 at 16:50
  • Seeding will occur regardless of what you do. The only question is if you let the program to use its own source of randomness or you want to provide it. – DanielTuzes Nov 16 '21 at 19:44
  • got it. Then it was seeding on its own. My theory is that because we were calling choice for various different percentage distributions (e.g. one would be [10,90] and another [20,20,40]) I figured they were somehow affecting with each other? It was bizarre that we would consistently get an 80/20 split in practice despite verifiably using 90/10 in the function call. – Layla Nov 23 '21 at 19:25