3

Table(df):

customer_id    Order_date 
    1       2015-01-16      
    1       2015-01-19      
    2       2014-12-21      
    2       2015-01-10      
    1       2015-01-10
    3       2018-01-18
    3       2017-03-04
    4       2019-11-05
    4       2010-01-01
    3       2019-02-03
    3       2020-01-01
    3       2018-01-01

Output needed: A subset of the df where customer_IDs have more than 3 order_dates. (skipping 2,4 and 5 customer ID since they have less than 3 order_dates)

 Customer_id  Number_of_Order_dates
     1          3
     3          5

I have tried groupby but it has not worked to create a subset. Please help.

Codes tried so far which failed:

df[df['days'].count()>3]

and one more I tried which is incorrect:

 df1=df.groupby('customer_id')['order_date'].count()
 df[df1.iloc[:,1]]
anky
  • 74,114
  • 11
  • 41
  • 70
noob
  • 3,601
  • 6
  • 27
  • 73

5 Answers5

6

IIUC

df.groupby('customer_id')['Order_date'].nunique().loc[lambda x : x>=3].reset_index()
Out[94]: 
   customer_id  Order_date
0            1           3
1            3           5
BENY
  • 317,841
  • 20
  • 164
  • 234
4

You can use:

df.groupby('customer_id').filter(lambda x: 
     (x['Order_date'].nunique()>=3)).groupby('customer_id').count()

Or:

(df[df.groupby('customer_id')['Order_date'].transform('nunique').ge(3)]
   .groupby('customer_id').count())

             Order_date
customer_id            
1                     3
3                     5
anky
  • 74,114
  • 11
  • 41
  • 70
2

Using GroupBy.nunique with DataFrame.query:

df.groupby('customer_id')['Order_date'].nunique().reset_index().query('Order_date >= 3')

   customer_id  Order_date
0            1           3
2            3           5
Erfan
  • 40,971
  • 8
  • 66
  • 78
1

with dict

d = {}
for c, o in zip(*map(df.get, df)):
    d.setdefault(c, set()).add(o)

pd.DataFrame(
    [(c, len(o)) for c, o in d.items() if len(o) >= 3],
    columns=[*df]
)

   customer_id  Order_date
0            1           3
1            3           5

with pd.factorize and np.bincount

i, u = df.drop_duplicates().customer_id.factorize()
c = np.bincount(i)

pd.DataFrame(
    [(u_, c_) for u_, c_ in zip(u, c) if c_ > 2],
    columns=[*df]
)

   customer_id  Order_date
0            1           3
1            3           5
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

a brute force approach is to add the groupby as a new column (pointers), with a name like num_dates and then restrict the overall df like this:

result = my_df[my_df['num_dates'] > 3]

treetopper
  • 46
  • 4