Regroup column values in a pandas df

Question

I have a script that assigns a value based off two columns in a pandas df. The code below is able to implement the 1st step but I'm struggling with the second.

So the script should initially:

1) Assign a Person for each individual string in [Area] and the first 3 unique values in [Place]

2) Look to reassign People with less than 3 unique values Example. The df below have 6 unique values in [Area] and [Place]. But 3 People are assigned. Ideally, 2 people will 2 unique values each

d = ({
    'Time' : ['8:03:00','8:17:00','8:20:00','10:15:00','10:15:00','11:48:00','12:00:00','12:10:00'],                 
   'Place' : ['House 1','House 2','House 1','House 3','House 4','House 5','House 1','House 1'],                 
    'Area' : ['X','X','Y','X','X','X','X','X'],    
     })

df = pd.DataFrame(data=d)

def g(gps):
        s = gps['Place'].unique()
        d = dict(zip(s, np.arange(len(s)) // 3 + 1))
        gps['Person'] = gps['Place'].map(d)
        return gps

df = df.groupby('Area', sort=False).apply(g)
s = df['Person'].astype(str) + df['Area']
df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')

Output:

       Time    Place Area    Person
0   8:03:00  House 1    X  Person 1
1   8:17:00  House 2    X  Person 1
2   8:20:00  House 1    Y  Person 2
3  10:15:00  House 3    X  Person 1
4  10:15:00  House 4    X  Person 3
5  11:48:00  House 5    X  Person 3
6  12:00:00  House 1    X  Person 1
7  12:10:00  House 1    X  Person 1

As you can see, the first step works fine. or each individual string in [Area], the first 3 unique values in [Place] are assigned to a Person. This leaves Person 1 with 3 values, Person 2 with 1 value and Person 3 with 2 values.

The second step is where I'm struggling.

If a Person has less than 3 unique values assigned to them, alter this so each Person has up to 3 unique values

Intended Output:

       Time    Place Area    Person
0   8:03:00  House 1    X  Person 1
1   8:17:00  House 2    X  Person 1
2   8:20:00  House 1    Y  Person 2
3  10:15:00  House 3    X  Person 1
4  10:15:00  House 4    X  Person 2
5  11:48:00  House 5    X  Person 2
6  12:00:00  House 1    X  Person 1
7  12:10:00  House 1    X  Person 1

I've read the question 10 times, I still don't understand the logic you are trying to explain. What are the input columns (Place & Area?). I am especially confused by the rule with the unique values within 1 hour. — kevh, Sep 24 '18 at 20:37
@kevh I've added a greater description. I didn't want to confuse readers with too much info. Is the question clearer or still confusing? — , Sep 24 '18 at 23:36
Still confusing. maybe you want to group persons instead of combining them? — 576i, Sep 25 '18 at 07:18
Does the description after the intended output make sense? Can you see how I get the output at all? Can you be a bit more specific in what you don't understand. — , Sep 25 '18 at 09:06
@PeterJames123 could another valid solution be `Index 2` = `Person 1` (with subsequent rows then different)? If not, what is the constraint? — jamesj629, Sep 25 '18 at 21:09
@jamesj629 no sorry. The 1st aim to to assign in groups of 3. For the 'offcuts/leftovers' I want to combine or reassign — , Sep 25 '18 at 23:29
I've stripped the content back significantly. Can you be more specific on what your confused about. Which index in the intended output is confusing you — , Sep 26 '18 at 03:40
In your first step you make sure that persons get assignment within one area. In the next step you are OK to ignore this condition to make sure every person gets 3 assignments. However, it looks like you put an additional condition that it would take it least 1 hr to get from one area to another? Correct? Otherwise, why you just don't ignore your area condition in the first step. — Sergey, Sep 26 '18 at 03:56
Now the `Area` column of output and intended output do not match. And that’s an input column. — pietroppeter, Sep 27 '18 at 06:05
I would assume the correct column for Area is X X Y X Z Z. Is that right? — pietroppeter, Sep 27 '18 at 06:07
@Pietro P. Your second attempt the second attempt grouped people in 3 regardless of the first rule. I need something that groups people _after_ people have been assigned. Outputs have been amended. Sorry for the confusion — , Sep 27 '18 at 07:03
Yeah, I think I will start from scratch trying to answer. Still think the two outputs do not match (Y in wrong position in one of the two). Let me ask you this: if I have previously assigned house 1 and area X to Person 1, can it happen that I need to reassign a subsequent occurrence of these two values to a new Person? — pietroppeter, Sep 27 '18 at 07:11
What would happen if on you have only 3 rows with (3 areas with 1 place on each area)? How many people would that be - 1 or 3? — Borislav Aymaliev, Sep 27 '18 at 12:59
My datasets are between 300-500 so that wouldn't ever happen. But for aruguemnts sake it would be 1 — , Sep 28 '18 at 00:19

pietroppeter · Answer 1 · 2018-09-30T21:59:00.277

current attempt

In the following I have added a few lines before last lines of your code:

d = ({'Time': ['8:03:00', '8:17:00', '8:20:00', '10:15:00', '10:15:00', '11:48:00', '12:00:00', '12:10:00'],
      'Place': ['House 1', 'House 2', 'House 1', 'House 3', 'House 4', 'House 5', 'House 1', 'House 1'],
      'Area': ['X', 'X', 'Y', 'X', 'X', 'X', 'X', 'X']})

df = pd.DataFrame(data=d)


def g(gps):
        s = gps['Place'].unique()
        d = dict(zip(s, np.arange(len(s)) // 3 + 1))
        gps['Person'] = gps['Place'].map(d)
        return gps


df = df.groupby('Area', sort=False).apply(g)
s = df['Person'].astype(str) + df['Area']

# added lines
t = s.value_counts()
df_sub = df.loc[s[s.isin(t[t < 3].index)].index].copy()
df_sub["tag"] = df_sub["Place"] + df_sub["Area"]
tags = list(df_sub.tag.unique())
f = lambda x: f'R{int(tags.index(x) / 3) + 1}'
df_sub['reassign'] = df_sub.tag.apply(f)
s[s.isin(t[t < 3].index)] = df_sub['reassign']

df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')

To be honest I am not so sure it works in all cases, but it gives your intended output in the test case.

Previous attempts

Let's see if I am able to help with a limited understanding of what you are trying to do.

You have sequential data (I'll call them events) and you want to assign to each event a "person" identifier. The identifier you will assign on each successive event depends on previous assignments and it seems to me it need to be governed by the following rules to be applied sequentially:

I know you: I can reuse a previous identifier if: same values for "Place" and "Area" already appeared for a given identifier (has time something to do with it?).
I do NOT know you: I will create a new identifier if: a new value of Area appears (so Place and Area play different roles?).
do I know you?: I might reuse a previously used identifier if: an identifier has not been assigned to at least three events (what if this happens for multiple identifiers? I will assume I use the oldest...).
nah, I don't: in case none of the preceding rules apply, I will create a new identifier.

Having assumed the above the following is an implementation of a solution:

# dict of list of past events assigned to each person. key is person identifier
people = dict()
# new column for df (as list) it will be appended at the end to dataframe
persons = list()


# first we define the rules
def i_know_you(people, now):
    def conditions(now, past):
        return [e for e in past if (now.Place == e.Place) and (now.Area == e.Area)]
    i_do = [person for person, past in people.items() if conditions(now, past)]
    if i_do:
        return i_do[0]
    return False


def i_do_not_know_you(people, now):
    conditions = not bool([e for past in people.values() for e in past if e.Area == now.Area])
    if conditions:
        return f'Person {len(people) + 1}'
    return False


def do_i_know_you(people, now):
    i_do = [person for person, past in people.items() if len(past) < 3]
    if i_do:
        return i_do[0]
    return False


# then we process the sequential data
for event in df.itertuples():
    print('event:', event)
    for rule in [i_know_you, i_do_not_know_you, do_i_know_you]:
        person = rule(people, event)
        print('\t', rule.__name__, person)
        if person:
            break
    if not person:
        person = f'Person {len(people) + 1}'
        print('\t', "nah, I don't", person)
    if person in people:
        people[person].append(event)
    else:
        people[person] = [event]
    persons.append(person)

df['Person'] = persons

Output:

event: Pandas(Index=0, Time='8:00:00', Place='House 1', Area='X', Person='Person 1')
     i_know_you False
     i_do_not_know_you Person 1
event: Pandas(Index=1, Time='8:30:00', Place='House 2', Area='X', Person='Person 1')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 1
event: Pandas(Index=2, Time='9:00:00', Place='House 1', Area='Y', Person='Person 2')
     i_know_you False
     i_do_not_know_you Person 2
event: Pandas(Index=3, Time='9:30:00', Place='House 3', Area='X', Person='Person 1')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 1
event: Pandas(Index=4, Time='10:00:00', Place='House 4', Area='X', Person='Person 2')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 2
event: Pandas(Index=5, Time='10:30:00', Place='House 5', Area='X', Person='Person 2')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 2
event: Pandas(Index=6, Time='11:00:00', Place='House 1', Area='X', Person='Person 1')
     i_know_you Person 1
event: Pandas(Index=7, Time='11:30:00', Place='House 6', Area='X', Person='Person 3')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you False
     nah, I don't Person 3
event: Pandas(Index=8, Time='12:00:00', Place='House 7', Area='X', Person='Person 3')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 3
event: Pandas(Index=9, Time='12:30:00', Place='House 8', Area='X', Person='Person 3')
     i_know_you False
     i_do_not_know_you False
     do_i_know_you Person 3

and the final dataframe is, as you want:

       Time    Place Area    Person
0   8:00:00  House 1    X  Person 1
1   8:30:00  House 2    X  Person 1
2   9:00:00  House 1    Y  Person 2
3   9:30:00  House 3    X  Person 1
4  10:00:00  House 4    X  Person 2
5  10:30:00  House 5    X  Person 2
6  11:00:00  House 1    X  Person 1
7  11:30:00  House 6    X  Person 3
8  12:00:00  House 7    X  Person 3
9  12:30:00  House 8    X  Person 3

Remark: Note that I intentionally avoided using grouped by operations and processed data sequentially. I think this kind of complexity (and not really understanding what you want to do...) calls for that approach. Also, you can adapt the rules to be more complicated (is time really playing a role or not?) using the same structure above.

Updated answer for new data

Looking at new data it is evident I did not understand what you are trying to do (in particular, the assignement does not seem to follow sequential rules). I would have a solution that would work on your second dataset, but it would give a different result for the first dataset.

The solution is much simpler and will add a column (that you can drop later if you want):

df["tag"] = df["Place"] + df["Area"]
tags = list(df.tag.unique())
f = lambda x: f'Person {int(tags.index(x) / 3) + 1}'
df['Person'] = df.tag.apply(f)

On the second dataset, it would give:

       Time    Place Area       tag    Person
0   8:00:00  House 1    X  House 1X  Person 1
1   8:30:00  House 2    X  House 2X  Person 1
2   9:00:00  House 3    X  House 3X  Person 1
3   9:30:00  House 1    Y  House 1Y  Person 2
4  10:00:00  House 1    Z  House 1Z  Person 2
5  10:30:00  House 1    V  House 1V  Person 2

On the first dataset it gives:

       Time    Place Area       tag    Person
0   8:00:00  House 1    X  House 1X  Person 1
1   8:30:00  House 2    X  House 2X  Person 1
2   9:00:00  House 1    Y  House 1Y  Person 1
3   9:30:00  House 3    X  House 3X  Person 2
4  10:00:00  House 4    X  House 4X  Person 2
5  10:30:00  House 5    X  House 5X  Person 2
6  11:00:00  House 1    X  House 1X  Person 1
7  11:30:00  House 6    X  House 6X  Person 3
8  12:00:00  House 7    X  House 7X  Person 3
9  12:30:00  House 8    X  House 8X  Person 3

This is different from your intended output on index 2 and 3. Is this output fine with your requirement? Why not?

Thanks Pietro. This is pretty close. I think there needs to be a 5th rule there. Something that's determines how values are there associated with each identifier. If less than 3 look to combine. I'll add a df at the bottom on the question using your code and what it should be. — , Sep 26 '18 at 09:49
You are Welcome. I will look into the edited question and try to provide an updtated answer. — pietroppeter, Sep 26 '18 at 09:55
Thanks Pietro. I need something that reassigns the 'leftovers' after the initial step. Your process just groups in 3 now regardless of the first step — , Sep 28 '18 at 00:22

David · Accepted Answer · 2018-09-27T07:55:52.953

As far as I understand, you're happy with everything before the Person allocation. So here's a plug and play solution to "merge" Persons with less than 3 unique values so each Person ends up with 3 unique values except for the last one obviously (based on the second to last df you posted ("Output:") without touching the ones that have already 3 unique values and just merges the others.

EDIT: Greatly simplified code. Again, taking your df as input:

n = 3
df['complete'] = df.Person.apply(lambda x: 1 if df.Person.tolist().count(x) == n else 0)
df['num'] = df.Person.str.replace('Person ','')
df.sort_values(by=['num','complete'],ascending=True,inplace=True) #get all persons that are complete to the top

c = 0
person_numbers = []
for x in range(0,999): #Create the numbering [1,1,1,2,2,2,3,3,3,...] with n defining how often a person is 'repeated'
    if x % n == 0:
        c += 1        
    person_numbers.append(c) 

df['Person_new'] = person_numbers[0:len(df)] #Add the numbering to the df
df.Person = 'Person ' + df.Person_new.astype(str) #Fill the person column with the new numbering
df.drop(['complete','Person_new','num'],axis=1,inplace=True)

Thanks @David. I'm getting an error though `df.Person = df.Person.str.replace(p_in_df[c],p) NameError: name 'c' is not defined` — , Sep 26 '18 at 23:28
Thanks for the feedback @PeterJames123. I rethought the code, see my edit. As before, this works if your previous allocation is correct. The n variable gives you flexibibilty to change the number of max values for one person. — David, Sep 27 '18 at 07:57

score 0 · Answer 3 · answered Sep 30 '18 at 00:52

Firstly, this answer does not keep with your requirement to only reassign leftovers (so I don't expect you to accept it). That said, I'm posting it anyway because your time-window constraint was tricky to solve within a pandas world. Perhaps my solution will not be useful for you right now but maybe later ;) At the very least it was a learning experience for me - so perhaps others can gain from it.

import pandas as pd
from datetime import datetime, time, timedelta
import random

# --- helper functions for demo

random.seed( 0 )

def makeRandomTimes( nHours = None, mMinutes = None ):
    nHours = 10 if nHours is None else nHours
    mMinutes = 3 if mMinutes is None else mMinutes
    times = []
    for _ in range(nHours):
        hour = random.randint(8,18)
        for _ in range(mMinutes):
            minute = random.randint(0,59)
            times.append( datetime.combine( datetime.today(), time( hour, minute ) ) )
    return times

def makeDf():
    times   = makeRandomTimes()
    houses  = [ str(random.randint(1,10)) for _ in range(30) ]
    areas   = [ ['X','Y'][random.randint(0,1)] for _ in range(30) ]
    df      = pd.DataFrame( {'Time' : times, 'House' : houses, 'Area' : areas } )
    return df.set_index( 'Time' ).sort_index()

# --- real code begins

def evaluateLookback( df, idx, dfg ):

    mask = df.index >= dfg.Lookback.iat[-1]
    personTotals = df[ mask ].set_index('Loc')['Person'].value_counts()
    currentPeople = set(df.Person[ df.Person > -1 ]) 
    noAllocations = currentPeople - set(personTotals.index)
    available = personTotals < 3
    if noAllocations or available.sum():
        # allocate to first available person
        person = min( noAllocations.union(personTotals[ available ].index) )
    else:
        # allocate new person
        person = len( currentPeople )
    df.Person.at[ idx ] = person
    # debug
    df.Verbose.at[ idx ] = ( noAllocations, available.sum() )


def lambdaProxy( df, colName ):
    [ dff[1][colName].apply( lambda f: f(df,*dff) ) for dff in df.groupby(df.index) ]


lookback = timedelta( minutes = 120 )

df1 = makeDf()
df1[ 'Loc' ] = df1[ 'House' ] + df1[ 'Area' ]
df1[ 'Person' ] = None
df1[ 'Lambda' ] = evaluateLookback
df1[ 'Lookback' ] = df1.index - lookback
df1[ 'Verbose' ] = None
lambdaProxy( df1, 'Lambda' )

print( df1[ [ col for col in df1.columns if col != 'Lambda' ] ] )

And the sample output on my machine looks like this:

                    House Area  Loc Person            Lookback         Verbose
Time
2018-09-30 08:16:00     6    Y   6Y      0 2018-09-30 06:16:00         ({}, 0)
2018-09-30 08:31:00     4    Y   4Y      0 2018-09-30 06:31:00         ({}, 1)
2018-09-30 08:32:00    10    X  10X      0 2018-09-30 06:32:00         ({}, 1)
2018-09-30 09:04:00     4    X   4X      1 2018-09-30 07:04:00         ({}, 0)
2018-09-30 09:46:00    10    X  10X      1 2018-09-30 07:46:00         ({}, 1)
2018-09-30 09:57:00     4    X   4X      1 2018-09-30 07:57:00         ({}, 1)
2018-09-30 10:06:00     1    Y   1Y      2 2018-09-30 08:06:00         ({}, 0)
2018-09-30 10:39:00    10    X  10X      0 2018-09-30 08:39:00        ({0}, 1)
2018-09-30 10:48:00     7    X   7X      0 2018-09-30 08:48:00         ({}, 2)
2018-09-30 11:08:00     1    Y   1Y      0 2018-09-30 09:08:00         ({}, 3)
2018-09-30 11:18:00     2    Y   2Y      1 2018-09-30 09:18:00         ({}, 2)
2018-09-30 11:32:00     9    X   9X      2 2018-09-30 09:32:00         ({}, 1)
2018-09-30 12:22:00     5    Y   5Y      1 2018-09-30 10:22:00         ({}, 2)
2018-09-30 12:30:00     9    X   9X      1 2018-09-30 10:30:00         ({}, 2)
2018-09-30 12:34:00     6    X   6X      2 2018-09-30 10:34:00         ({}, 1)
2018-09-30 12:37:00     1    Y   1Y      2 2018-09-30 10:37:00         ({}, 1)
2018-09-30 12:45:00     4    X   4X      0 2018-09-30 10:45:00         ({}, 1)
2018-09-30 12:58:00     8    X   8X      0 2018-09-30 10:58:00         ({}, 1)
2018-09-30 14:26:00     7    Y   7Y      0 2018-09-30 12:26:00         ({}, 3)
2018-09-30 14:48:00     2    X   2X      0 2018-09-30 12:48:00     ({1, 2}, 1)
2018-09-30 14:50:00     8    X   8X      1 2018-09-30 12:50:00     ({1, 2}, 0)
2018-09-30 14:53:00     8    Y   8Y      1 2018-09-30 12:53:00        ({2}, 1)
2018-09-30 14:56:00     6    X   6X      1 2018-09-30 12:56:00        ({2}, 1)
2018-09-30 14:58:00     9    Y   9Y      2 2018-09-30 12:58:00        ({2}, 0)
2018-09-30 17:09:00     2    Y   2Y      0 2018-09-30 15:09:00  ({0, 1, 2}, 0)
2018-09-30 17:19:00     4    X   4X      0 2018-09-30 15:19:00     ({1, 2}, 1)
2018-09-30 17:57:00     6    Y   6Y      0 2018-09-30 15:57:00     ({1, 2}, 1)
2018-09-30 18:21:00     3    X   3X      1 2018-09-30 16:21:00     ({1, 2}, 0)
2018-09-30 18:30:00     9    X   9X      1 2018-09-30 16:30:00        ({2}, 1)
2018-09-30 18:35:00     8    Y   8Y      1 2018-09-30 16:35:00        ({2}, 1)
>>>

Notes:

the lookback variable controls the length of time looking backwards to consider locations as allocated to a person
the Lookback column shows the cutoff time
evaluateLookback is called repeatedly for each row in the table, with df being the whole DataFrame, idx the current index/label, and dfg the current row.
lambdaProxy controls the calling of evaluateLookback.
the number of locations per person is set to 3 but that could be adjusted as needed
arbitrarily complex requirements for the lookback period can be managed by having another func column which is first evaluated by lambdaProxy and then that result stored and used within evaluateLookback

There's some interesting edge cases in the demo output: 10:39:00, 14:48:00, 17:09:00

Aside: It would be interesting to see "functions column" in pandas, perhaps with memorize-like capability? Ideally, the 'Person' column should take a function and calc on request, either with its own row or with some variable window view. Anyone seen something like that?

score 0 · Answer 4 · answered Sep 30 '18 at 03:19

How about this for step 2:

def reduce_df(df):
    values = df['Area'] + df['Place']
    df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
    person_count = df1.groupby('Person')['Person'].agg('count')
    leftover_count = person_count[person_count < 3] # the 'leftovers'

    # try merging pairs together
    nleft = leftover_count.shape[0]
    to_try = np.arange(nleft - 1)
    to_merge = (leftover_count.values[to_try] + 
                leftover_count.values[to_try + 1]) <= 3
    to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
    to_merge = to_try[to_merge]
    merge_dict = dict(zip(leftover_count.index.values[to_merge+1], 
                    leftover_count.index.values[to_merge]))
    def change_person(p):
        if p in merge_dict.keys():
            return merge_dict[p]
        return p
    reduced_df = df.copy()
    # update df with the merges you found
    reduced_df['Person'] = reduced_df['Person'].apply(change_person)
    return reduced_df

print(
    reduce_df(reduce_df(df)) # call twice in case 1,1,1 -> 2,1 -> 3
)

The output:

Area    Place      Time    Person
0    X  House 1   8:03:00  Person 1
1    X  House 2   8:17:00  Person 1
2    Y  House 1   8:20:00  Person 2
3    X  House 3  10:15:00  Person 1
4    X  House 4  10:15:00  Person 2
5    X  House 5  11:48:00  Person 2
6    X  House 1  12:00:00  Person 1
7    X  House 1  12:10:00  Person 1

Regroup column values in a pandas df

4 Answers4

current attempt

Previous attempts

Updated answer for new data