6

Starting with this sample data...

import pandas as pd

start_data = {"person_id": [1, 1, 1, 1, 2], "nid": [1, 2, 3, 4, 1],
              "beg": ["Jan 1 2018", "Jan 5 2018", "Jan 10 2018", "Feb 5 2018", "Jan 25 2018"],
              "end": ["Feb 1 2018", "Mar 4 2018", "", "Oct 18 2018", "Nov 10 2018"]}
df = pd.DataFrame(start_data)
df["beg"] = pd.to_datetime(df["beg"])
df["end"] = pd.to_datetime(df["end"])

Starting point:

   person_id  nid        beg        end
0          1    1 2018-01-01 2018-02-01
1          1    2 2018-01-05 2018-03-04
2          1    3 2018-01-10        NaT
3          1    4 2018-02-05 2018-10-18
4          2    1 2018-01-25 2018-11-10

Goal output:

person_id date       1 2 3 4
        1 2018-01-01 1 0 0 0
        1 2018-01-05 1 1 0 0
        1 2018-01-10 1 1 1 0
        1 2018-02-01 0 1 1 0
        1 2018-02-05 0 1 1 1
        1 2018-03-04 0 0 1 1
        1 2018-10-18 0 0 1 0 
        2 2018-01-25 1 0 0 0
        2 2018-11-10 0 0 0 0

I am trying to tie all active nid's to the associated person_id This will then be joined to another dataframe based on the latest date less than a dated activity column. And finally this will be part of the input into a predictive model.

Doing something like pd.get_dummies(df["nid"]) get's this output:

   1  2  3  4
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  0  0  0  1
4  1  0  0  0

So this needs to be moved to a different index representing effective date, grouped by person_id, and then aggregated to match the goal output.

Special bonus to anyone who can come up with an approach that would properly leverage Dask. This is what we are using for other parts of the pipeline due to the scalability. This may be a pipe-dream but I thought I would send it out to see what comes back.

Chris Farr
  • 3,580
  • 1
  • 21
  • 24

4 Answers4

2

The question is hard , I can only think of numpy broadcast to speed up the for loop

s=df.set_index('person_id')[['beg','end']].stack()
l=[]
for x , y in df.groupby('person_id'):
    y=y.fillna({'end':y.end.max()})
    s1=y.beg.values
    s2=y.end.values
    t=s.loc[x].values
    l.append(pd.DataFrame(((s1-t[:,None]).astype(float)<=0)&((s2-t[:,None]).astype(float)>0),columns=y.nid,index=s.loc[[x]].index))
s=pd.concat([s,pd.concat(l).fillna(0).astype(int)],1).reset_index(level=0).sort_values(['person_id',0])
s
Out[401]: 
     person_id          0  1  2  3  4
beg          1 2018-01-01  1  0  0  0
beg          1 2018-01-05  1  1  0  0
beg          1 2018-01-10  1  1  1  0
end          1 2018-02-01  0  1  1  0
beg          1 2018-02-05  0  1  1  1
end          1 2018-03-04  0  0  1  1
end          1 2018-10-18  0  0  0  0
beg          2 2018-01-25  1  0  0  0
end          2 2018-11-10  0  0  0  0
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Similar to @WenYoBen's approach, a little different in broadcasting and return:

def onehot(group):
    pid, g = group

    ends = g.end.fillna(g.end.max())
    begs = g.beg

    days = pd.concat((ends,begs)).sort_values().unique()

    ret = pd.DataFrame((days[:,None] < ends.values) & (days[:,None]>= begs.values),
                    columns= g.nid)
    ret['persion_id'] = pid
    return ret


new_df = pd.concat([onehot(group) for group in df.groupby('person_id')], sort=False)
new_df.fillna(0).astype(int)

Output:

    1   2   3   4   persion_id
0   1   0   0   0   1
1   1   1   0   0   1
2   1   1   1   0   1
3   0   1   1   0   1
4   0   1   1   1   1
5   0   0   1   1   1
6   0   0   0   0   1
0   1   0   0   0   2
1   0   0   0   0   2
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
0

Here is a function that one-hot-encodes data based on a beg_col and end_col effective date range. One edge case to look out for is multiple beginning effective dates for the same target column. You can add some clever filtering to the function to handle that, but I'll just keep the simple version on here.

def effective_date_range_one_hot_encode(x, beg_col="beg", end_col="end", target="nid"):
    pos_change = x.loc[:, [beg_col, target]]
    pos_change = pos_change.set_index(beg_col)
    pos_change = pd.get_dummies(pos_change[target])

    neg_change = x.loc[:, [end_col, target]]
    neg_change = neg_change.set_index(end_col)
    neg_change = pd.get_dummies(neg_change[target]) * -1

    changes = pd.concat([pos_change, neg_change])

    changes = changes.sort_index()
    changes = changes.cumsum()

    return changes


new_df = df.groupby("person_id").apply(effective_date_range_one_hot_encode).fillna(0).astype(int)
new_df.index = new_df.index.set_names(["person_id", "date"])
new_df = new_df.reset_index()
new_df = new_df.dropna(subset=["date"], how="any")

The function can be applied using .groupby() and if you need this to be ran in a distributed environment, you can utilize the .map_partitions() function in Dask. Just first set your index to the column you plan to groupby then create a helper function to reset the index.

Output

   person_id effective_date  1  2  3  4
0          1     2018-01-01  1  0  0  0
1          1     2018-01-05  1  1  0  0
2          1     2018-01-10  1  1  1  0
3          1     2018-02-01  0  1  1  0
4          1     2018-02-05  0  1  1  1
5          1     2018-03-04  0  0  1  1
6          1     2018-10-18  0  0  1  0
8          2     2018-01-25  1  0  0  0
9          2     2018-11-10  0  0  0  0
Chris Farr
  • 3,580
  • 1
  • 21
  • 24
0

A bit late for the OP but this should help others that have this problem. I came across a very similar issue and solved it in the following way.

Original Data from OP:

start_data = {"person_id": [1, 1, 1, 1, 2], "nid": [1, 2, 3, 4, 1],
              "beg": ["Jan 1 2018", "Jan 5 2018", "Jan 10 2018", "Feb 5 2018", "Jan 25 2018"],
              "end": ["Feb 1 2018", "Mar 4 2018", "", "Oct 18 2018", "Nov 10 2018"]}
df = pd.DataFrame(start_data)
df["beg"] = pd.to_datetime(df["beg"])
df["end"] = pd.to_datetime(df["end"])

Proposed Solution:

from dateutil.rrule import rrule, DAILY

# Create an empty df which we'll append the results to 
months_df = pd.DataFrame( columns= ['jan', 'feb', 'mar', 'apr', 'may', 'jun',
        'july', 'aug', 'sep', 'oct', 'nov', 'dec'])

# Create function to loop through a list and remove any dates that occured before a certain date 
def remove_dates(date_range, date_range2):    
    for i in range(0,len(date_range)):
        if date_range[i] > datetime.datetime(2017,12,31):
            date_range2.append(date_range[i])
    return date_range2

months = [1,2,3,4,5,6,7,8,9,10,11,12] # this is used in the list comprehension 

for i in range(0, len(df)):
    # Return list of weeks that are in each date range (i.e. weeks between "Day of Start Date" and "Day of End Date")
    date_range = [dt for dt in rrule(DAILY, dtstart=df.loc[:,'beg'][i],\
                                     until=df.loc[:,'end'][i])]
    
    # Remove any dates that occurred before some arbitrary cutoff
    date_range2 = []
    date_range = remove_dates(date_range, date_range2)
    
    months_list = set([date.month for date in date_range]) # Return unique months
    months_list = [elem in months_list for elem in months] # Check which months of the year are present in the date range
    # Append results to months_df
    months_df = months_df.append(pd.DataFrame(months_list,\
                             index=['jan', 'feb', 'mar', 'apr', 'may', 'jun',
        'july', 'aug', 'sep', 'oct', 'nov', 'dec']).T, ignore_index=False)


df = df.join(months_df.reset_index(drop=True)) # Merge the two dfs

Output

   person_id  nid        beg        end    jan   feb    mar    apr    may  \
0          1    1 2018-01-01 2018-02-01   True  True  False  False  False   
1          1    2 2018-01-05 2018-03-04   True  True   True  False  False   
2          1    3 2018-01-10        NaT   True  True   True   True   True   
3          1    4 2018-02-05 2018-10-18  False  True   True   True   True   
4          2    1 2018-01-25 2018-11-10   True  True   True   True   True   

     jun   july    aug    sep    oct    nov    dec  
0  False  False  False  False  False  False  False  
1  False  False  False  False  False  False  False  
2   True   True   True   True   True   True   True  
3   True   True   True   True   True  False  False  
4   True   True   True   True   True   True  False  

Comments:

  • I included a function remove_dates. This was because I wanted to exclude dates that occurred before some arbitrary cutoff. For example, I was looking at data for the year 2019, but some contracts could have started in 2018 - I wanted to exclude those months in 2018 from being counted in 2019. This function accomplishes that.
  • The parameter "DAILY" should be analyzed and changed depending on the use case
  • For the observation with NaT as an end date, I considered that to be TRUE for every month. I wasn't 100% clear how the OP wanted that to be handled. In case the user wants to handle this differently, I would set all blank values to be an explicit date to avoid any unexpected results
shredGnar
  • 149
  • 2
  • 11