I have a data frame with user_ids stored as an indexed frame_table in an HDFStore. Also in this HDF file is another table with actions the user took. I want to grab all of the actions taken by 1% of the users. The procedure is as follows:
#Get 1% of the user IDs
df_id = store.select('df_user_id', columns = ['id'])
1pct_users = rnd.sample(df_id.id.unique(), 0.01*len(df_id.id.unique()))
df_id = df_id[df_id.id.isin(1pct_users)]
Now I want to go back and get all of the additional info that describes the actions taken by these users from frame_tables identically indexed as df_user_id. As per this example and this question I have done the following:
1pct_actions = store.select('df_actions', where = pd.Term('index', 1pct_users.index))
This simply provides an empty data frame. In fact, if I copy and paste the example in the previous pandas doc link I also get an empty data frame. Did something change about Term
in recent pandas? I'm on pandas 0.12.
I'm not tied to any particular solution. As long as I can get hdfstore indices from a lookup on the df_id table (which is fast) and then directly pull those indices from the other frame tables.