3

So I have a dataset that includes dates and values, corresponding to those dates.

date  value  category
1951-07 199 1
1951-07 130 3
1951-07 50  5
1951-08 199 1
1951-08 50  5
1951-08 199 1
1951-09 184 2
1951-09 50  5
1951-09 13  13

Now my goal is to find the values, that repeat each month. Resulting in a frame like this:

date  value  category
1951-07 50  5
1951-08 50  5
1951-09 50  5

Also not regarding values that repeat inside a month, or that repeat only for a few months, but not all.

The categories do often pai with the value (like shown in the example), but sometimes they don't. So I tried doing it by category, but it didn't give me exact results.

My current approach is to filter for duplicates and then get those, that occure 12 times (as i'm searching per year). But it also gives me values, that repeat 12 sides inside a month.

df = df[df.duplicated(['value'],keep=False)]
v = df.value.value_counts()
df_12 = df[df.value.isin(v.index[v.gt(12)])]

Any help would be appreciated.

3 Answers3

4

I would first group by values and remove duplicates on dates:

tmp = df.groupby('value')['date'].apply(lambda x: x.drop_duplicates())

Your sample would give:

value   
13     8    1951-09
50     2    1951-07
       4    1951-08
       7    1951-09
130    1    1951-07
184    6    1951-09
199    0    1951-07
       3    1951-08
Name: date, dtype: object

Then we can safely count the values and only keep the ones having the expected count::

total = tmp.groupby(level=0).count()
total = total[total == 3]

We get:

value
50    3
Name: date, dtype: int64

We can finaly filter the original dataframe:

df[df['value'].isin(total.index)]

giving the expected:

      date  value  category
2  1951-07     50         5
4  1951-08     50         5
7  1951-09     50         5

From Jezrael comment, the first steps to build total should become:

total = df.drop_duplicates(['date', 'value'])[['date', 'value']
                                              ].groupby('value').count()['date']
total = total[total == 3]

it is both simpler and faster...

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1

Here's one way pivoting and checking where all rows are notna:

piv = df.pivot_table(index='date', columns='category', values='value', aggfunc='first')

df[df.category.eq(piv.notna().all(0).idxmax())]

    date    value  category
2  1951-07     50         5
4  1951-08     50         5
7  1951-09     50         5

Where:

print(piv)

category     1      2      3     5     13
date                                     
1951-07   199.0    NaN  130.0  50.0   NaN
1951-08   199.0    NaN    NaN  50.0   NaN
1951-09     NaN  184.0    NaN  50.0  13.0
yatu
  • 86,083
  • 12
  • 84
  • 139
1

First get the duplicated

df_dups = df[df.duplicated(subset=['value', 'category'], keep=False)]

Then remove those that are duplicated inside only a single month, i.e.

df_dups = df_dups.groupby(['value', 'category']).filter(lambda g: g['date'].nunique() > 1)

You can also remove the duplicated inside the resulting table, i.e.

df_dups = df_dups.groupby(['value', 'category']).apply(lambda g: g.drop_duplicates('date', keep='last))
Oleg O
  • 1,005
  • 6
  • 11