5

How do I find values (userIDs), which occur frequently together, based on a timestamp?

My question is linked to this question: Session generation from log file analysis with pandas - however, my data is already sessionized, I want to go a step further and find users which login at the same time, which means that 'sessionBegin' is close by.

Sure we have to set a granularity, let us assume that users which have a 'sessionBegin' lower than 30 Minutes apart logined at the same time.

# my data (a series with level-2 index):

                         sessionBegin
userID    sessionID

      A            1        2014-5-7 14:15
      A            2        2014-5-8 16:30
      B            3        2014-5-7 20:33
      C            4        2014-5-7 14:20
      C            5        2014-5-7 18:58
      C            5        2014-5-8 16:30
      D            6        2014-5-7 15:01
      D            6        2014-5-8 12:04

In this example there clearly is a co-occurrence (statistical dependence?) between userID A and C.

I was thinking of setting the timestamp as index and use a rolling-window of the size 30 mins, but I did now know how to recognize re-occurring sets of userIDs. Is it possible to recognize not only pairs of userIDs but also larger sets?

Thomas
  • 103
  • 1
  • 1
  • 7
  • 1
    I wonder if k-means clustering would have a fit here. You'd have to convert the sessionBegin to a numeric. – Bob Haffner Mar 17 '15 at 23:35
  • Well timestamps are _numeric_, as we can convert them into unix-epoch etc. However, each object possess an arbitrary amount of timestamps (dimensions), which, afaik, leads to a difficult distance measure. Maybe pearsons coefficient or a regression analysis might help? – Thomas Mar 18 '15 at 15:49
  • 1
    I don't think "clustering" is what you are looking for, and k-means for sure will not work. There are some temporal pattern mining algorithms, but I can't give you names. – Has QUIT--Anony-Mousse Mar 18 '15 at 23:40

0 Answers0