Re-assign unique values - pandas DataFrame

Question

I am trying to assign unique values in pandas df to specific individuals.

For the df below, [Area] and [Place] will together make up unique values that are various jobs. These values will be assigned to individuals with the overall aim to use the least amount of individuals possible.

The trick is these values are constantly starting and finishing and go for different lengths of time. The most unique values assigned to an individual any one time is 3. [On] displays how many current unique values for [Place] and [Area] are occurring.

So this provides a concrete guide on how many individuals I need. e.g. 3 unique values one = 1 person, 6 unique values on = 2 persons

I can't do a groupby statement where I assign the first 3 unique values to individual 1 and the next 3 unique values to individual 2 etc.

What I envisage is, when unique values are greater than 3 I want to group values in [Area] first and then combine the leftovers. So look to assign same values in [Area] to an individual (up to 3). Then, if there are _leftover_ values (<3), they should be combined to make a group of 3, where possible.

The way I envisage this working is: see into the future by an hour. For each new row of values the script should see how many values will be [On](this provides an indication of how many total individuals are required). Where unique values are >3, they should be assigned by grouping the same value in [Area]. If there are leftover values they should be combined anyhow to make up to a group of 3.

For the df below, the number of unique values occurring for [Place] and [Area] varies between 1-6. So we should never have more than 2 individuals assigned. When unique values are >3 it should be assigned by [Area] first. The leftover values should be combined with other individuals that have less than 3 unique values.

Apologies for the large df. It's the only way i can replicate the problem!

import pandas as pd
import numpy as np
from collections import Counter

d = ({   
    'Time' : ['8:03:00','8:17:00','8:20:00','8:33:00','8:47:00','8:48:00','9:03:00','9:15:00','9:18:00','9:33:00','9:45:00','9:48:00','10:03:00','10:15:00','10:15:00','10:15:00','10:18:00','10:32:00','10:33:00','10:39:00','10:43:00','10:48:00','10:50:00','11:03:00','11:03:00','11:07:00','11:25:00','11:27:00','11:42:00','11:48:00','11:51:00','11:57:00','12:00:00','12:08:00','12:15:00','12:17:00','12:25:00','12:30:00','12:35:00','12:39:00','12:47:00','12:52:00','12:55:00','13:00:00','13:03:00','13:07:00','13:12:00','13:15:00','13:22:00','13:27:00','13:27:00'],
    'Area' : ['A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','B','A','B','A','A','A','A','B','A','A','B','B','A','B','C','A','B','C','C','A','B','C','C','B','A','C','B','C','C','A','C','B','C','C','A','C'],
    'Place' : ['House 1','House 2','House 3','House 1','House 3','House 2','House 1','House 3','House 2','House 1','House 3','House 2','House 1','House 3','House 4','House 1','House 2','House 1','House 1','House 4','House 3','House 2','House 1','House 1','House 4','House 1','House 1','House 4','House 1','House 1','House 4','House 1','House 2','House 1','House 4','House 1','House 1','House 2','House 1','House 4','House 1','House 1','House 3','House 2','House 4','House 1','House 2','House 4','House 1','House 4','House 2'],
    'On' : ['1','2','3','3','3','3','3','3','3','3','3','3','3','3','4','5','5','5','5','5','5','4','3','3','3','2','2','2','2','3','3','3','4','4','4','4','4','4','4','4','4','4','4','4','4','4','5','6','6','6','6'],
    'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 2','Person 3','Person 1','Person 3','Person 1','Person 2','Person 1','Person 1','Person 3','Person 1','Person 2','Person 3','Person 3','Person 2','Person 3','Person 4','Person 2','Person 3','Person 4','Person 4','Person 2','Person 3','Person 4','Person 4','Person 3','Person 2','Person 4','Person 3','Person 4','Person 4','Person 2','Person 4','Person 3','Person 5','Person 4','Person 2','Person 4'],
    })

df = pd.DataFrame(data=d)

def getAssignedPeople(df, areasPerPerson):
    areas = df['Area'].values
    places = df['Place'].values
    times = pd.to_datetime(df['Time']).values
    maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
    assignmentCount = Counter()
    assignedPeople = []
    assignedPlaces = {}
    heldPeople = {}
    heldAreas = {}
    holdAvailable = True
    person = 0

    # search for repeated areas. Mark them if the next repeat occurs within an hour
    ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
    holds = np.zeros(areas.size, dtype=bool)
    holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

    for area,place,hold in zip(areas, places, holds):
        if (area, place) in assignedPlaces:
            # this unique (area, place) has already been assigned to someone
            assignedPeople.append(assignedPlaces[(area, place)])
            continue

        if assignmentCount[person] >= areasPerPerson:
            # the current person is already assigned to enough areas, move on to the next
            a = heldPeople.pop(person, None)
            heldAreas.pop(a, None)
            person += 1

        if area in heldAreas:
            # assign to the person held in this area
            p = heldAreas.pop(area)
            heldPeople.pop(p)
        else:
            # get the first non-held person. If we need to hold in this area, 
            # also make sure the person has at least 2 free assignment slots,
            # though if it's the last person assign to them anyway 
            p = person
            while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
                p += 1

        assignmentCount.update([p])
        assignedPlaces[(area, place)] = p
        assignedPeople.append(p)

        if hold:
            if p==maxPerson:
                # mark that there are no more people available to perform holds
                holdAvailable = False

            # this area recurrs in an hour, mark that the person should be held here
            heldPeople[p] = area
            heldAreas[area] = p

    return assignedPeople

def allocatePeople(df, areasPerPerson=3):
    assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
    df = df.copy()
    df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
    return df

print(allocatePeople(df))

Output:

        Time Area    Place On    Person
0    8:03:00    A  House 1  1  Person 1
1    8:17:00    A  House 2  2  Person 1
2    8:20:00    A  House 3  3  Person 1
3    8:33:00    A  House 1  3  Person 1
4    8:47:00    A  House 3  3  Person 1
5    8:48:00    A  House 2  3  Person 1
6    9:03:00    A  House 1  3  Person 1
7    9:15:00    A  House 3  3  Person 1
8    9:18:00    A  House 2  3  Person 1
9    9:33:00    A  House 1  3  Person 1
10   9:45:00    A  House 3  3  Person 1
11   9:48:00    A  House 2  3  Person 1
12  10:03:00    A  House 1  3  Person 1
13  10:15:00    A  House 3  3  Person 1
14  10:15:00    A  House 4  4  Person 2
15  10:15:00    B  House 1  5  Person 2
16  10:18:00    A  House 2  5  Person 1
17  10:32:00    B  House 1  5  Person 2
18  10:33:00    A  House 1  5  Person 1
19  10:39:00    A  House 4  5  Person 2
20  10:43:00    A  House 3  5  Person 1
21  10:48:00    A  House 2  4  Person 1
22  10:50:00    B  House 1  3  Person 2
23  11:03:00    A  House 1  3  Person 1
24  11:03:00    A  House 4  3  Person 2
25  11:07:00    B  House 1  2  Person 2
26  11:25:00    B  House 1  2  Person 2
27  11:27:00    A  House 4  2  Person 2
28  11:42:00    B  House 1  2  Person 2
29  11:48:00    C  House 1  3  Person 2
30  11:51:00    A  House 4  3  Person 2
31  11:57:00    B  House 1  3  Person 2
32  12:00:00    C  House 2  4  Person 3
33  12:08:00    C  House 1  4  Person 2
34  12:15:00    A  House 4  4  Person 2
35  12:17:00    B  House 1  4  Person 2
36  12:25:00    C  House 1  4  Person 2
37  12:30:00    C  House 2  4  Person 3
38  12:35:00    B  House 1  4  Person 2
39  12:39:00    A  House 4  4  Person 2
40  12:47:00    C  House 1  4  Person 2
41  12:52:00    B  House 1  4  Person 2
42  12:55:00    C  House 3  4  Person 3
43  13:00:00    C  House 2  4  Person 3
44  13:03:00    A  House 4  4  Person 2
45  13:07:00    C  House 1  4  Person 2
46  13:12:00    B  House 2  5  Person 3
47  13:15:00    C  House 4  6  Person 4
48  13:22:00    C  House 1  6  Person 2
49  13:27:00    A  House 4  6  Person 2
50  13:27:00    C  House 2  6  Person 3

Intended Output and Comments on why I think it should be assigned:

There is just too much code and text here. Please reduce this to a [mcve] with some simpler data that can be easily digested. — cs95, Jan 11 '19 at 15:54
You didn't post the dataframe `df` that's giving you the output that you posted. Please add that, since the question is unanswerable without it. — tel, Jan 12 '19 at 01:38
Also, while it was a fun challenge to figure out the word soup in your [last question](https://stackoverflow.com/q/52730864/425458), it was kind of like trying to decipher the Dead Sea Scrolls. I'm going to echo coldspeed and ask you to read the [MCV example](https://stackoverflow.com/help/mcve) documentation and take it to heart. Follow it like a bible the next time you write an SO question. This will help people to write better, more general answers that can help other answer seekers in the future. — tel, Jan 12 '19 at 01:41
Currently your question is unanswerable, we will give answers only if we know how `df` looks like, please please post `df`... — U13-Forward, Jan 13 '19 at 02:26
My apologies on this oversight everyone. I have added the example `df` — , Jan 13 '19 at 22:36
@tel, dead sea scrolls is a bit rough but I'll take that on board. It definitely wasn't without trying. I wrote and re-wrote the question numerous times to ensure adequate info was included — , Jan 13 '19 at 22:41
@PeterJames123 You've run into an edge case for how holds are determined that the current code isn't handling correctly. `person 2` is getting held in area `A` for the sake of a job that's already been assigned to `person 1`. I posted an answer with a fix. — tel, Jan 15 '19 at 21:24
@PeterJames123 Sorry if my previous comment was a bit too salty (I'm dealing with my own problems right now). More constructively, I can say that your previous question would have been better posed as a set of 10 desired inputs/outputs. Your plain text descriptions simply had too many unspecified edge cases. For something this complicated, every time you list a requirement/constraint, you should accompany it with some kind of example (code or desired input/output) that unambiguously clarifies what you want. — tel, Jan 15 '19 at 21:52
@tel, all good, I had a bit of a chuckle at the dead sea scrolls. I initially did post a few various input/outputs but got some feedback saying to strip it back to one sample input/output. — , Jan 16 '19 at 00:01
@PeterJames123 Sadly, learning proper etiquette on SO is like threading a needle. I've figured out how to write "acceptable" answers, but people still often downvote my questions. Some will complain about length, but you have a complex problem, so it does require some quantity of explanation. More important than overall length is removing redundant stuff. For example, the `on` column of your dataframes is not necessary (it can be calculated if/when needed), nor is the explanation relating to it. It suffices to say "at most 3 unique (area, place) pairs per person". — tel, Jan 16 '19 at 01:41
@tel. Thanks for your help with this. It's incredibly complex to me. Your code thus far is brilliant. — , Jan 16 '19 at 06:08
@tel, is it possible to move this conversion to a chat? I don't want to fill up the comments. I think the `[On]` `Column` is necessary. Is there a way to display a larger input `df` — , Jan 16 '19 at 23:34
@tel, I'm still having some issues. I've tried to replicate the all the cases in smaller `df's` but I just wanted to display an actual dataset that I'm dealing with. I'm not sure if this is helpful but it represents that variable change of meetings and how the people assigned are related to how `[On]` — , Jan 18 '19 at 00:58
@PeterJames123 I've thought about the problem with `On` some more. When `On` goes down, I know that **a** job has ended, but I don't know **which** job has ended. When one of the jobs belonging to person `i` ends, I need to be able to mark that person `i` is available for more work (ie by decrementing `assignmentCount[i]`). There's waaaay too many ambiguous cases as `On` currently stands. — tel, Jan 20 '19 at 17:17
@PeterJames123 I made a [chat room for this question](https://chat.stackoverflow.com/rooms/187022/schedule-all-the-things). Come join me. Let us discuss the finer points of scheduling, and see if we can't hash something out. — tel, Jan 21 '19 at 02:27

tel · Answer 1 · 2019-01-16T06:56:31.240

There's a live version of this answer online that you can try for yourself.

The problem

The bug you're seeing is due to (yet another) interesting edge case of your problem. During the 6th job, the code assigns person 2 to (A, House 4). It then then sees that area A repeats within an hour, so it holds person 2 in that area. This makes person 2 unavailable for the next job, which is in area B.

However, there's no reason to hold person 2 in area A for the sake of a job that occurs in (A, House 1), since the unique combination of area and place (A, House 1) has already been assigned to person 1.

The solution

The problem can be fixed by considering only unique combinations of area and place when deciding when to hold a person in an area. Only a couple of lines of code have to change.

First, we construct a list of areas that correspond to the unique (area, place) pairs:

unqareas = df[['Area', 'Place']].drop_duplicates()['Area'].values

Then we just substitute unqareas for areas in the first line of the code that identifies holds:

ixrep = np.argmax(np.triu(unqareas.reshape(-1, 1)==unqareas, k=1), axis=1)

Complete listing/test

import pandas as pd
import numpy as np
from collections import Counter

d = ({
     'Time' : ['8:03:00','8:07:00','8:10:00','8:23:00','8:27:00','8:30:00','8:37:00','8:40:00','8:48:00'],
     'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 4','House 1','House 1'],
     'Area' : ['A','A','A','A','A','A','A','B','A'],
     'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 2','Person 3','Person 1'],
     'On' : ['1','2','3','3','3','3','4','5','5']
     })

df = pd.DataFrame(data=d)

def getAssignedPeople(df, areasPerPerson):
    areas = df['Area'].values
    unqareas = df[['Area', 'Place']].drop_duplicates()['Area'].values
    places = df['Place'].values
    times = pd.to_datetime(df['Time']).values

    maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
    assignmentCount = Counter()
    assignedPeople = []
    assignedPlaces = {}
    heldPeople = {}
    heldAreas = {}
    holdAvailable = True
    person = 0

    # search for repeated areas. Mark them if the next repeat occurs within an hour
    ixrep = np.argmax(np.triu(unqareas.reshape(-1, 1)==unqareas, k=1), axis=1)
    holds = np.zeros(areas.size, dtype=bool)
    holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')

    for area,place,hold in zip(areas, places, holds):
        if (area, place) in assignedPlaces:
            # this unique (area, place) has already been assigned to someone
            assignedPeople.append(assignedPlaces[(area, place)])
            continue

        if assignmentCount[person] >= areasPerPerson:
            # the current person is already assigned to enough areas, move on to the next
            a = heldPeople.pop(person, None)
            heldAreas.pop(a, None)
            person += 1

        if area in heldAreas:
            # assign to the person held in this area
            p = heldAreas.pop(area)
            heldPeople.pop(p)
        else:
            # get the first non-held person. If we need to hold in this area, 
            # also make sure the person has at least 2 free assignment slots,
            # though if it's the last person assign to them anyway 
            p = person
            while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
                p += 1

        assignmentCount.update([p])
        assignedPlaces[(area, place)] = p
        assignedPeople.append(p)

        if hold:
            if p==maxPerson:
                # mark that there are no more people available to perform holds
                holdAvailable = False

            # this area recurrs in an hour, mark that the person should be held here
            heldPeople[p] = area
            heldAreas[area] = p

    return assignedPeople

def allocatePeople(df, areasPerPerson=3):
    assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
    df = df.copy()
    df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
    return df

print(allocatePeople(df))

Output:

      Time    Place Area    Person On
0  8:03:00  House 1    A  Person 1  1
1  8:07:00  House 2    A  Person 1  2
2  8:10:00  House 3    A  Person 1  3
3  8:23:00  House 1    A  Person 1  3
4  8:27:00  House 2    A  Person 1  3
5  8:30:00  House 3    A  Person 1  3
6  8:37:00  House 4    A  Person 2  4
7  8:40:00  House 1    B  Person 2  5
8  8:48:00  House 1    A  Person 1  5

Re-assign unique values - pandas DataFrame

1 Answers1

The problem

The solution

Complete listing/test

Linked