0

I have a pretty bad way to convert my input logs to the input dataset. I have an SFrame sf with the following format:

user_id     int
timestamp   datetime.datetime
action      int
reasoncode  str

action column takes up 9 values ranging from 1 to 9.

So, every user_id can perform more than 1 action, more than once.

I am trying to obtain all unique user_id from sf and create an op_sf in the following manner:

y = 225

def calc_class(a,x):
  diffd = a['timestamp'].apply(lambda x: (dte - x).days)
  g = 0
  b = 0
  for i in diffd:
    if i > y:
    g += 1
  else:
    b += 1
  if b>= x:
    return 4
  elif b!= 0:
    return 3
  elif g>= 0:
    return 2
  else:
    return 1

l1 = []
ids = z['user_id'].unique()

for idd in ids:
 temp = sf[sf['user_id']== idd]
 zero1 = temp[temp['action'] == 1]
 zero2 = temp[temp['action'] == 2]
 zero3 = temp[temp['action'] == 3]
 zero4 = temp[temp['action'] == 4]
 zero5 = temp[temp['action'] == 5]
 zero6 = temp[temp['action'] == 6]
 zero7 = temp[temp['action'] == 7]
 zeroh8 = temp[temp['reasoncode'] == 'xyz']
 zero9 = temp[temp['reasoncode'] == 'abc']
 /* I'm getting clas1 to clas9 from function calc_class for each action
    clas1 to clas9 are 4 integers ranging from 1 to 4
 */ 
 clas1 = calc_class(zero1,2)
 clas2 = calc_class(zero2,2)
 clas3 = calc_class(zero3,2)
 clas4 = calc_class(zero4,2)
 clas5 = calc_class(zero5,2)
 clas6 = calc_class(zero6,2)
 clas7 = calc_class(zero7,2)
 clas8 = calc_class(zero8,2)
 clas9 = calc_class(zero9,2)
 l1.append([idd,clas1,clas2,clas3,clas4,clas5*(-1),clas6*(-1),clas7*(-1),clas8*(-1),clas9])

I wanted to know if this is the fastest way of doing this. Specifically if it is possible to do the same thing without generating the zero1 to zero9 SFrames.

An example sf:

user_id timestamp action reasoncode 
574 23/09/15 12:43  1   None
574 23/09/15 11:15  2   None
574 06/10/15 11:20  2   None
574 06/10/15 11:21  3   None
588 04/11/15 10:00  1   None
588 05/11/15 10:00  1   None
555 15/12/15 13:00  1   None
585 22/12/15 17:30  1   None
585 15/01/16 07:44  7   xyz
588 06/01/16 08:10  7   abc

l1 corresponding to the above sf:

574 1   2   2   0   0   0   0   0   0
588 3   0   0   0   0   0   0   0   3
555 3   0   0   0   0   0   0   0   0
585 3   0   0   0   0   0   0   3   0
Parisa Rai
  • 155
  • 2
  • 4
  • 9
  • From this code snippet, it's hard to understand what you're trying to do. Could you be a bit more detailed in describing your goal, and maybe show small examples of the input and output data? – papayawarrior Jun 15 '16 at 16:46
  • 1
    You seem to have some complicated logic going on here. As @papayawarrior suggested, it would be helpful to have a simpler example with some sample data. However, from the logic that generates all the "zero" sframes, I don't see any reason for you to loop over every unique ID. You could probably avoid generating all those by using apply with a function that has this logic (e.g. if action == 6: use x6, whatever that is, if 7, use x7) and just apply it over the whole SFrame. You could also do the transformation of the timestamp to the entire SFrame. – Evan Samanas Jun 15 '16 at 16:58
  • @EvanSamanas can you give an example? – Parisa Rai Jun 16 '16 at 07:23

1 Answers1

1

I think your logic is relatively complex, but it's still more efficient to use column-wise operations on the whole dataset, rather than extracting the subset of rows for each user. The key tools are SFrame.groupby, SFrame.apply, SFrame.unstack, and SFrame.unpack. API docs here:

https://dato.com/products/create/docs/generated/graphlab.SFrame.html

Here's a solution that uses slightly simpler data than your example and slightly simpler logic to code the old vs. new actions.

# Set up and make the data
import graphlab as gl
import datetime as dt

sf = gl.SFrame({'user': [574, 574, 574, 588, 588, 588],
                'timestamp': [dt.datetime(2015, 9, 23), dt.datetime(2015, 9, 23),
                              dt.datetime(2015, 10, 6), dt.datetime(2015, 11, 4),
                              dt.datetime(2015, 11, 5), dt.datetime(2016, 1, 6)],
                'action': [1, 2, 3, 1, 1, 7]})

# Count old vs. new actions.
sf['days_elapsed'] = (dt.datetime.today() - sf['timestamp']) / (3600 * 24)
sf['old_threshold'] = sf['days_elapsed'] > 225

aggregator = {'total_count': gl.aggregate.COUNT('user'),
              'old_count': gl.aggregate.SUM('old_threshold')}
grp = sf.groupby(['user', 'action'], aggregator)

# Code the actions according to old vs. new. Use your own logic here.
grp['action_code'] = grp.apply(
                       lambda x: 2 if x['total_count'] > x['old_count'] else 1)
grp = grp[['user', 'action', 'action_code']]

# Reshape the results into columns.
sf_new = (grp.unstack(['action', 'action_code'], new_column_name='action_code')
             .unpack('action_code'))

# Fill in zeros for entries with no actions.
for c in sf_new.column_names():
    sf_new[c] = sf_new[c].fillna(0)

print sf_new
+------+---------------+---------------+---------------+---------------+
| user | action_code.1 | action_code.2 | action_code.3 | action_code.7 |
+------+---------------+---------------+---------------+---------------+
| 588  |       2       |       0       |       0       |       2       |
| 574  |       1       |       1       |       1       |       0       |
+------+---------------+---------------+---------------+---------------+
[2 rows x 5 columns]
papayawarrior
  • 1,027
  • 7
  • 10