Pandas dataframe filter vectorized

Question

I have a DataFrame like so:

a = date       | user | value
    -------------------------
    2017-05-01 | AAA  | 1
    2017-05-02 | AAA  | 2
    ...        | AAA  | ...
    2017-08-31 | AAA  | 1.7
    2017-05-01 | BBB  | 1.5
    2017-05-02 | BBB  | 3
    ...        | BBB  | ...
    2017-08-31 | BBB  | 2.7

Additionally, I have a list like so:

b = start_date | user
    -----------------
    2017-05-02 | AAA
    2017-05-05 | AAA
    2017-05-03 | BBB

I now want to extract a 3-dimensional matrix from a that contains a list of the 3 next dates starting from the start_date given in b for the user given in b. So in the above example the following:

[
[[2017-05-02, AAA, 2], [2017-05-03, AAA, x], [2017-05-04, AAA, x]],
[[2017-05-05, AAA, 2], [2017-05-06, AAA, x], [2017-05-07, AAA, x]],
[[2017-05-03, BBB, 2], [2017-05-04, BBB, x], [2017-05-05, BBB, x]],
]

x stands for a random value, I was just too lazy to type all examples down.

I am using Python 3.5. Thanks a whole lot in advance. I am looking forward to your smart answers :-)

can you setup your dataframe so that they are copy and paste-able, like they are shown [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and also, structure b correctly. That's not how lists are setup. — DJK, Aug 28 '17 at 17:21

BENY · Accepted Answer · 2017-08-28T17:49:35.823

try this .. Also check the link about 'How to make good reproducible pandas examples'

DF=pd.merge(df1,df2,on='user',how='outer')
DF['DIFF']=DF['date_x']-DF['date_y']
DF=DF.loc[(DF.DIFF/np.timedelta64(1, 'D')>=0)&(DF.DIFF/np.timedelta64(1, 'D')<=2),:]
DF.drop('DIFF',1).groupby([DF.index.get_level_values(0),'date_y'])['date_x','user','value'].\
   apply(lambda x :x.values.tolist()).values.tolist()

Out[760]: 
[[[Timestamp('2017-05-02 00:00:00'), 'AAA  ', 2.0],
  [Timestamp('2017-05-03 00:00:00'), 'AAA  ', 2.0],
  [Timestamp('2017-05-04 00:00:00'), 'AAA  ', 2.0]],
 [[Timestamp('2017-05-05 00:00:00'), 'AAA  ', 2.0],
  [Timestamp('2017-05-06 00:00:00'), 'AAA  ', 2.0],
  [Timestamp('2017-05-07 00:00:00'), 'AAA  ', 1.7]],
 [[Timestamp('2017-05-03 00:00:00'), 'BBB  ', 3.0],
  [Timestamp('2017-05-04 00:00:00'), 'BBB  ', 3.0],
  [Timestamp('2017-05-05 00:00:00'), 'BBB  ', 3.0]]]

Data Input

df1
Out[748]:
         date   user  value
0  2017-05-01  AAA      1.0
1  2017-05-02  AAA      2.0
2  2017-05-03  AAA      2.0
3  2017-05-04  AAA      2.0
4  2017-05-05  AAA      2.0
5  2017-05-06  AAA      2.0
6  2017-05-07  AAA      1.7
7  2017-05-01  BBB      1.5
8  2017-05-02  BBB      3.0
9  2017-05-03  BBB      3.0
10 2017-05-04  BBB      3.0
11 2017-05-05  BBB      3.0

df2
Out[749]: 
        date   user
0 2017-05-02  AAA  
1 2017-05-05  AAA  
2 2017-05-03  BBB

Thanks for your help. I will pay attention to building correct pandas data frames in the future. Thank you! — Lukas Hestermeyer, Aug 29 '17 at 05:41

Pandas dataframe filter vectorized

1 Answers1