Finding monthly repeating values in dataframe (pandas)

Question

So I have a dataset that includes dates and values, corresponding to those dates.

date  value  category
1951-07 199 1
1951-07 130 3
1951-07 50  5
1951-08 199 1
1951-08 50  5
1951-08 199 1
1951-09 184 2
1951-09 50  5
1951-09 13  13

Now my goal is to find the values, that repeat each month. Resulting in a frame like this:

date  value  category
1951-07 50  5
1951-08 50  5
1951-09 50  5

Also not regarding values that repeat inside a month, or that repeat only for a few months, but not all.

The categories do often pai with the value (like shown in the example), but sometimes they don't. So I tried doing it by category, but it didn't give me exact results.

My current approach is to filter for duplicates and then get those, that occure 12 times (as i'm searching per year). But it also gives me values, that repeat 12 sides inside a month.

df = df[df.duplicated(['value'],keep=False)]
v = df.value.value_counts()
df_12 = df[df.value.isin(v.index[v.gt(12)])]

Any help would be appreciated.

Serge Ballesta · Accepted Answer · 2020-03-05T12:36:58.163

I would first group by values and remove duplicates on dates:

tmp = df.groupby('value')['date'].apply(lambda x: x.drop_duplicates())

Your sample would give:

value   
13     8    1951-09
50     2    1951-07
       4    1951-08
       7    1951-09
130    1    1951-07
184    6    1951-09
199    0    1951-07
       3    1951-08
Name: date, dtype: object

Then we can safely count the values and only keep the ones having the expected count::

total = tmp.groupby(level=0).count()
total = total[total == 3]

We get:

value
50    3
Name: date, dtype: int64

We can finaly filter the original dataframe:

df[df['value'].isin(total.index)]

giving the expected:

      date  value  category
2  1951-07     50         5
4  1951-08     50         5
7  1951-09     50         5

From Jezrael comment, the first steps to build total should become:

total = df.drop_duplicates(['date', 'value'])[['date', 'value']
                                              ].groupby('value').count()['date']
total = total[total == 3]

it is both simpler and faster...

@jezrael: thanks for the head up! I proceeded step by step and did not realized it :-( — Serge Ballesta, Mar 05 '20 at 12:17
@jezrael: I have edited my post with your comment. Thank you again... — Serge Ballesta, Mar 05 '20 at 12:37
The edited code actually does leave in some values, that only appear once. For my dataset, the first approach works much more reliable. — Sascha Stenger, Mar 05 '20 at 12:54

yatu · Answer 2 · 2020-03-05T11:03:02.747

Here's one way pivoting and checking where all rows are notna:

piv = df.pivot_table(index='date', columns='category', values='value', aggfunc='first')

df[df.category.eq(piv.notna().all(0).idxmax())]

    date    value  category
2  1951-07     50         5
4  1951-08     50         5
7  1951-09     50         5

Where:

print(piv)

category     1      2      3     5     13
date                                     
1951-07   199.0    NaN  130.0  50.0   NaN
1951-08   199.0    NaN    NaN  50.0   NaN
1951-09     NaN  184.0    NaN  50.0  13.0

Oleg O · Answer 3 · 2020-03-05T13:15:46.840

1

First get the duplicated

df_dups = df[df.duplicated(subset=['value', 'category'], keep=False)]

Then remove those that are duplicated inside only a single month, i.e.

df_dups = df_dups.groupby(['value', 'category']).filter(lambda g: g['date'].nunique() > 1)

You can also remove the duplicated inside the resulting table, i.e.

df_dups = df_dups.groupby(['value', 'category']).apply(lambda g: g.drop_duplicates('date', keep='last))

edited Mar 05 '20 at 13:15

answered Mar 05 '20 at 11:12

Oleg O

1,005
6
11

The last line leads to an error: DataFrameGroupBy' object has no attribute 'drop_duplicates' – Sascha Stenger Mar 05 '20 at 13:06

Finding monthly repeating values in dataframe (pandas)

3 Answers3