4

I have the following dataframe :

 date_time             value     member
2013-10-09 09:00:00  664639  Jerome
2013-10-09 09:05:00  197290  Hence
2013-10-09 09:10:00  470186  Ann
2013-10-09 09:15:00  181314  Mikka
2013-10-09 09:20:00  969427  Cristy
2013-10-09 09:25:00  261473  James
2013-10-09 09:30:00  003698  Oliver

and the second dataframe where I have the bounds like :

   date_start            date_end
2013-10-09 09:19:00         2013-10-09 09:25:00
2013-10-09 09:25:00         2013-10-09 09:40:00 

so I need to create a new column where I will write the index of each interval between two datetime points:

smth like:

date_time             value     member    session
2013-10-09 09:00:00  664639  Jerome        1
2013-10-09 09:05:00  197290  Hence         1
2013-10-09 09:10:00  470186  Ann            1
2013-10-09 09:15:00  181314  Mikka          2
2013-10-09 09:20:00  969427  Cristy         2
2013-10-09 09:25:00  261473  James          2
2013-10-09 09:30:00  003698  Oliver         2

the following code creates the column 'session', but doesn't write the index of session (i.e. index of row in bounds dataframe) in 'session' column, so don't separate the initial dataframe on intervals:

def create_interval():
    df['session']=''
    for index, row in bounds.iterrows():
        s = row['date_start']
        e = row['date_end']
        mask=(df['date'] > s) & (df['date'] < e)
        df.loc[mask]['session']='[index]'

    return df

UPDATE

problem that code bounds['date_start'].searchsorted(df['date_time']) doesn't give the result I want to obtain, i.e. one index value for every interval: df['Session'] = 1 for first interval, =2 for second and so on. Columns Session is aimed to separate different intervals that lye in between date_start and date_end of bounds I suppose that if df['date_time'] not the same that bounds['start_date'] it already increments index for session, that not exactly what I'm looking for

  • I don't understand why the session value is 1 for only rows up to time `09:15:00` when the `date_start` in `bounds` begins at `09:19:00`? – EdChum Oct 14 '15 at 09:45
  • @EdChum, sorry, I just gave the example of structure of dataframe I have, not the exact value! if you need the exact value to come up with answer, don't hesitate to ask me for that! –  Oct 14 '15 at 09:49
  • [This](http://stackoverflow.com/q/17559885/2325172) question and answers could probably help you. – TobiMarg Oct 14 '15 at 09:50
  • 1
    Please post the incorrect output and the desired output, English has subtlety in meanings that code does not – EdChum Oct 14 '15 at 13:07
  • @EdChum I posted the incorrect output and the desired output, please let me know if you need smth else –  Oct 14 '15 at 13:30
  • Does my first solution work though? ` df['date_time'].apply(lambda x: np.searchsorted(bounds['date_start'], x)[0])`? – EdChum Oct 14 '15 at 13:32
  • @EdChum, it has the same incorrect output that I posted above –  Oct 14 '15 at 13:34
  • I don't understand why it doesn't work for you, it works fine for me – EdChum Oct 14 '15 at 13:37
  • @EdChurn so you are sure ,for each interval it assigns a unique index? 1 for fist interval, 2 for second, 3 for third and so on (interval is the rows between date_start and date_end of ```bounds```)? –  Oct 14 '15 at 13:40
  • 1
    like this: `In [306]: df df['Session'] = bounds['date_start'].searchsorted(df['date_time']) df Out[306]: date_time Session 0 2015-07-30 10:32:54 0 1 2015-07-30 10:32:54 0 2 2015-07-30 10:36:39 1 3 2015-07-30 10:36:39 1 4 2015-07-30 10:36:39 1 5 2015-07-30 10:36:39 1 6 2015-07-30 10:37:00 1 7 2015-07-30 10:37:00 1 8 2015-07-30 10:37:00 1 9 2015-07-30 10:37:00 1` same as your desired output – EdChum Oct 14 '15 at 13:42
  • @EdChum, I understood finally what happens! you are right, code works, but actually this method doesn't include the start_date itself into interval, so it disturbed me! do you know how I can include the start_date in each interval? –  Oct 14 '15 at 14:24
  • 1
    that's the default behaviour of searchsorted it it matches then it the index is the position before the current index value if you want it to fall into the next bin value then you'd need to widen the start interval perhaps so subtract 1 second from it or similar – EdChum Oct 14 '15 at 14:32

1 Answers1

2

I'm assuming you want the actual index location (zero-based), you can call apply on your 'date_time' column and call np.searchsorted to find the index location of where in bounds df it falls in:

In [266]:
df['Session'] = df['date_time'].apply(lambda x: np.searchsorted(bounds['date_start'], x)[0])
df

Out[266]:
            date_time   value  member  Session
0 2013-10-09 09:00:00  664639  Jerome        0
1 2013-10-09 09:05:00  197290   Hence        0
2 2013-10-09 09:10:00  470186     Ann        0
3 2013-10-09 09:15:00  181314   Mikka        0
4 2013-10-09 09:20:00  969427  Cristy        1
5 2013-10-09 09:25:00  261473   James        1
6 2013-10-09 09:30:00    3698  Oliver        2

EDIT

@Jeff has pointed out that apply is unnecessary here and of course he's right, this will be much faster:

In [293]:
df['session'] = bounds['date_start'].searchsorted(df['date_time']) 
df

Out[293]:
            date_time   value  member  session
0 2013-10-09 09:00:00  664639  Jerome        0
1 2013-10-09 09:05:00  197290   Hence        0
2 2013-10-09 09:10:00  470186     Ann        0
3 2013-10-09 09:15:00  181314   Mikka        0
4 2013-10-09 09:20:00  969427  Cristy        1
5 2013-10-09 09:25:00  261473   James        1
6 2013-10-09 09:30:00    3698  Oliver        2
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 2
    No need to use apply here: ```In [40]: bounds['date_start'].searchsorted(df['date_time']) Out[40]: array([0, 0, 0, 0, 1, 1, 2])``` – Jeff Oct 14 '15 at 10:34
  • 1
    Further, technically this should be bounds between the start/end times (e.g. searchsorted would return -1 if its out of bounds), but that's a detail that the OP didn't present data for. – Jeff Oct 14 '15 at 11:01
  • @Jeff yes that's true the problem domain has not been fully described but the OP seems happy with the behaviour here, `-1` to mark OOB is not a necessarily bad thing though – EdChum Oct 14 '15 at 11:21
  • it doesn't work for me when I verified, actually I need have an index incremented +1 for every session, but if, for example, in ```bounds``` dataframe I have the following two first date_start: 2015-07-30 10:32:54.000 , 2015-07-30 11:58:57.000 , after applying your code , I have a incremented index in ```Session``` already for 2015-07-30 10:36:39.000 in ```df ``` dataframe , that comes next to 2015-07-30 10:32:54.000 , i.e. it means that it should be the second interval despite the fact it stay int the same interval according the ```bounds``` dataframe –  Oct 14 '15 at 12:42
  • So why not just add `1`? `df['session'] = bounds['date_start'].searchsorted(df['date_time']) + 1`? – EdChum Oct 14 '15 at 12:44
  • Also edit into your question what your updated requirements are with an explanation, formatting is lost in comments – EdChum Oct 14 '15 at 12:50
  • it's not the point, adding 1 just replaces 0 with 1, 1 with 2, but my problem that ```bounds['date_start'].searchsorted(df['date_time'])``` doesn't give the result I want to obtain, i.e. : Session = 1 for first interval, =2 for second and so on. Columns ```Session``` is aimed to separate different inervales that lye in between date_start and date_end of ```bounds``` –  Oct 14 '15 at 12:56
  • Again I ask you, edit your question, also show why my code snippet doesn't work – EdChum Oct 14 '15 at 12:58
  • @EdChum, sorry, haven't seen that comment, done updated! –  Oct 14 '15 at 13:06