0

I am sorry if I couldn't word the question title perfectly.

This is actually quite similar to what I have asked previously before here. Which I actually got a solution for, but now I am facing different problem.

So I have the following dataframe

df = pd.DataFrame({'person':['A', 'A', 'B', 'B', 'A', 'C', 'B', 'C'],
                   'datetime':['2018-02-26 10:49:32', '2018-02-26 10:58:03', '2018-02-26 10:51:10',
                               '2018-02-26 10:58:45', '2018-02-26 10:43:34', '2018-02-26 10:49:51',
                              '2018-02-26 10:51:51', '2018-02-26 10:55:10'], 
                   'location':['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})

person  datetime    location
A   2018-02-26 10:49:32 a
A   2018-02-26 10:58:03 b
B   2018-02-26 10:51:10 c
B   2018-02-26 10:58:45 d
A   2018-02-26 10:43:34 e
C   2018-02-26 10:49:51 f
B   2018-02-26 10:51:51 g
C   2018-02-26 10:55:10 h

And if we sort it by person and time, we get

df.sort_values(by=['person', 'datetime'])

person  datetime    location
A   2018-02-26 10:43:34 e
A   2018-02-26 10:49:32 a
A   2018-02-26 10:58:03 b
B   2018-02-26 10:51:10 c
B   2018-02-26 10:51:51 g
B   2018-02-26 10:58:45 d
C   2018-02-26 10:49:51 f
C   2018-02-26 10:55:10 h

Which you can read as person A goes to location 'e', then goes to location 'a', then goes to 'b'. Same with person B, he goes to location 'c', then 'g', then 'd', and so on.

I want to turn it into like this

person  prev_datetime       prev_loc    curr_datetime       curr_loc    next_datetime   next_loc
A       2018-02-26 10:43:34     e       2018-02-26 10:49:32     a       2018-02-26 10:58:03 b
A       2018-02-26 10:49:32     a       2018-02-26 10:58:03     b       2018-02-26 10:43:34 e
B       2018-02-26 10:51:10     c       2018-02-26 10:51:51     g       2018-02-26 10:58:45 d
B       2018-02-26 10:51:51     g       2018-02-26 10:58:45     d       2018-02-26 10:51:10 c
C       2018-02-26 10:49:51     f       2018-02-26 10:55:10     h       2018-02-26 10:49:51 f

So, you see person A goes to location 'e', then goes to location 'a', then goes to 'b', but in the end we make this person return to e again (the beginning).

I applied what I have learned from the previous question I asked and did some modification.

df['datetime'] = pd.to_datetime(df['datetime'])
df1 = df.sort_values(by=['person', 'datetime'])

df1[['curr_datetime','curr_loc']] = df1.groupby('person')['datetime','location'].shift(-1)

d = {'datetime':'prev_datetime','location':'prev_loc'}
df2 = df1[df1['person'].duplicated(keep='last')].rename(columns=d)

Outputs

person  prev_datetime   prev_loc    next_datetime   next_loc
A   2018-02-26 10:43:34 e   2018-02-26 10:49:32 a
A   2018-02-26 10:49:32 a   2018-02-26 10:58:03 b
B   2018-02-26 10:51:10 c   2018-02-26 10:51:51 g
B   2018-02-26 10:51:51 g   2018-02-26 10:58:45 d
C   2018-02-26 10:49:51 f   2018-02-26 10:55:10 h

Then

df2[['next_datetime','next_loc']] = df2.groupby('person')['curr_datetime','curr_loc'].shift(-1)

Outputs

person  prev_datetime   prev_loc    curr_datetime   curr_loc    next_datetime   next_loc
A   2018-02-26 10:43:34 e   2018-02-26 10:49:32 a   2018-02-26 10:58:03 b
A   2018-02-26 10:49:32 a   2018-02-26 10:58:03 b   NaT nan
B   2018-02-26 10:51:10 c   2018-02-26 10:51:51 g   2018-02-26 10:58:45 d
B   2018-02-26 10:51:51 g   2018-02-26 10:58:45 d   NaT nan
C   2018-02-26 10:49:51 f   2018-02-26 10:55:10 h   NaT nan

By default, some of those are Nan, so I did the following,I want to fill those nan values with the min() of the trips by each person.

df2.loc[df2["next_loc"].isnull()][['next_datetime', 'next_loc']] = df2.loc[df2.groupby("person")["prev_datetime"].idxmin()][['prev_datetime', 'prev_loc']]

Which is not working, there is no error, but the dataframe still doesn't show the data as intended (still Nan). I am confused because there is no error message, but it still doesn't work. I don't know if there's better solution than this. Thanks.

catris25
  • 1,173
  • 3
  • 20
  • 40

1 Answers1

1

Use numpy.roll with GroupBy.transform:

df['datetime'] = pd.to_datetime(df['datetime'])
df1 = df.sort_values(by=['person', 'datetime'])

g = df1.groupby('person')['datetime','location']
df1[['curr_datetime','curr_loc']] = g.transform(np.roll, -1)
df1[['next_datetime','next_loc']] = g.transform(np.roll, 1)
print (df1)
  person            datetime location       curr_datetime curr_loc  \
4      A 2018-02-26 10:43:34        e 2018-02-26 10:49:32        a   
0      A 2018-02-26 10:49:32        a 2018-02-26 10:58:03        b   
1      A 2018-02-26 10:58:03        b 2018-02-26 10:43:34        e   
2      B 2018-02-26 10:51:10        c 2018-02-26 10:51:51        g   
6      B 2018-02-26 10:51:51        g 2018-02-26 10:58:45        d   
3      B 2018-02-26 10:58:45        d 2018-02-26 10:51:10        c   
5      C 2018-02-26 10:49:51        f 2018-02-26 10:55:10        h   
7      C 2018-02-26 10:55:10        h 2018-02-26 10:49:51        f   

        next_datetime next_loc  
4 2018-02-26 10:58:03        b  
0 2018-02-26 10:43:34        e  
1 2018-02-26 10:49:32        a  
2 2018-02-26 10:58:45        d  
6 2018-02-26 10:51:10        c  
3 2018-02-26 10:51:51        g  
5 2018-02-26 10:55:10        h  
7 2018-02-26 10:49:51        f  

d = {'datetime':'prev_datetime','location':'prev_loc'}
df2 = df1[df1['person'].duplicated(keep='last')].rename(columns=d)

print (df2)
  person       prev_datetime prev_loc       curr_datetime curr_loc  \
4      A 2018-02-26 10:43:34        e 2018-02-26 10:49:32        a   
0      A 2018-02-26 10:49:32        a 2018-02-26 10:58:03        b   
2      B 2018-02-26 10:51:10        c 2018-02-26 10:51:51        g   
6      B 2018-02-26 10:51:51        g 2018-02-26 10:58:45        d   
5      C 2018-02-26 10:49:51        f 2018-02-26 10:55:10        h   

        next_datetime next_loc  
4 2018-02-26 10:58:03        b  
0 2018-02-26 10:43:34        e  
2 2018-02-26 10:58:45        d  
6 2018-02-26 10:51:10        c  
5 2018-02-26 10:55:10        h  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • amazing! thank you! but in the case of there is only one trip done by a person, like on the last row of df2, you got f -> h -> h, I am thinking that it should be f -> h -> f. How do I do that? – catris25 Mar 30 '20 at 05:34
  • @catris25 - I try debug code and if rolling is count from `prev_loc` then it is not `h-f`, because original `f-h`? Because for `A` and `B` group it is with this logic, for `C` group not? In another words, if 2 values in some group, same order like `prev_loc` column? – jezrael Mar 30 '20 at 05:52
  • Why did you edited the code from shift() to transform()? Is there any difference in result? @jezrael – catris25 Mar 30 '20 at 06:10
  • @catris25 - It working both ways for `df1[['curr_datetime','curr_loc']]`, but still not sure what `C` group. If same logic like `A,B` output should be `h-f` – jezrael Mar 30 '20 at 06:12