2

I have a multiyear daily dataframe which has different entities(Col1) present in different years (Col2). I am trying to filter the data frame such that it retains only those rows which have a value in each of the specified year. (I have created Col2 from the datetime index thinking it would make filtering process easier)

In other words, I am trying to design a conditional filter where a column value satisfying a condition in another column is retained.

This is my dataframe:

Date    Col1 Col2 Col3 Col4
1/1/16  M   2016  34  a
2/2/16  M   2016  35  f
3/3/16  M   2016  45  d
1/1/17  S   2017  66  ff
2/2/17  S   2017  66  mn
1/1/18  T   2018  78  jh
2/2/18  T   2018  789  kk
1/1/17  M   2017  100  f
2/2/17  S   2017  2020 jj
2/2/18  M   2018  3005 m

Entities M,S and T record values in 2016-2019 but only M has values in all the three years.

Thus, resulting dataframe should look like this:

Date    Col1 Col2 Col3 Col4
1/1/16  M   2016 34   a
2/2/16  M   2016 35   f
3/3/16  M   2016 45   d
1/1/17  M   2017 100  f
2/2/18  M   2018 3005 m

I am trying to work out the logic through following lines of code, but none work:

(1)
    years = [2016,2017,2018]
    
for station in df_in['Col1']:
    years = [2016,2017,2018]
    if df_in['Col2'].isin(years).any():
        df = df_in
    else:
        df = df_in.drop(df_in[df_in.Col1].index)

OR

(2)
df= df_in['Col1'].apply(lambda x: x in df_in['year'].isin(years))

OR

(3)
df = df_in.loc[(df_in['Col1']) & (df_in['Col2'].isin(years))]

What will be a good way of achieving this?

Any help will be much appreciated.

3 Answers3

1

Check with filter + groupby

s=df.groupby('Col1').filter(lambda x : pd.Series([2016,2017,2018]).isin(x['Col2']).all())
     Date Col1  Col2
0  1/1/16    M  2016
1  2/2/16    M  2016
2  3/3/16    M  2016
7  1/1/17    M  2017
9  2/2/18    M  2018
BENY
  • 317,841
  • 20
  • 164
  • 234
1

If need compare exactly only values from years list compare sets for equal in GroupBy.transform for mask and then filtering in boolean indexing:

years = [2016,2017,2018]
df1 = df[df.groupby('Col1')['Col2'].transform(lambda x: set(x) == set(years))]
print (df1)
     Date Col1  Col2  Col3 Col4
0  1/1/16    M  2016    34    a
1  2/2/16    M  2016    35    f
2  3/3/16    M  2016    45    d
7  1/1/17    M  2017   100    f
9  2/2/18    M  2018  3005    m

If want compare exist values from list and possible some another values per groups compare for high or equal:

print (df)
     Date Col1  Col2  Col3 Col4
0  1/1/16    M  2019    34    a <- 2019
1  2/2/16    M  2016    35    f
2  3/3/16    M  2016    45    d
3  1/1/17    S  2017    66   ff
4  2/2/17    S  2017    66   mn
5  1/1/18    T  2018    78   jh
6  2/2/18    T  2018   789   kk
7  1/1/17    M  2017   100    f
8  2/2/17    S  2017  2020   jj
9  2/2/18    M  2018  3005    m

years = [2016,2017,2018]
df1 = df[df.groupby('Col1')['Col2'].transform(lambda x: set(x) >= set(years))]
print (df1)
     Date Col1  Col2  Col3 Col4
0  1/1/16    M  2019    34    a
1  2/2/16    M  2016    35    f
2  3/3/16    M  2016    45    d
7  1/1/17    M  2017   100    f
9  2/2/18    M  2018  3005    m
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @Learning_datascience - It depends of data, I test also `filter` solution and was `2.5` times slowier, so if large data or if performance is important here is better use `transform` – jezrael Jul 03 '20 at 13:49
  • @Learning_datascience - `np.random.seed(123) N = 10000 L = range(2016, 2020) df = pd.DataFrame({'Col2': np.random.choice(L, N), 'Col1':np.random.randint(N // 7, size=N)})` – jezrael Jul 03 '20 at 13:50
  • 1
    @Learning_datascience - `In [187]: %timeit df[df.groupby('Col1')['Col2'].transform(lambda x: set(x) >= set(years))] 193 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` `In [188]: %timeit df.groupby('Col1').filter(lambda x : pd.Series([2016,2017,2018]).isin(x['Col2']).all()) 530 ms ± 48.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Jul 03 '20 at 13:50
  • @jezrael..This approach works on simple dataframes like the one I have used in the example. However, the actual dataframe that I am using is has multiple columns that need to be retained. Using your code on my actual dataframe is yielding an empty dataframe. I am updating my question to show what I mean. – Learning_datascience Jul 03 '20 at 14:03
  • @Learning_datascience - Do you test both second solutions? Because second should be more matched – jezrael Jul 03 '20 at 14:05
  • @Learning_datascience - are values in col2 numeric? `print (df['Col2'].dtype)` return `int` or float ? Because another columns are not important here, edited answer. – jezrael Jul 03 '20 at 14:12
  • 1
    @Learning_datascience - Thanks, glad to help. Don't forget to accept the answer, if it suits you! :) – jezrael Jul 03 '20 at 14:19
0

In case you dont want to hardwire the years. Please try the following 1.Extract all the unique years into a list

  1. Groupby each year and generate a column with a list of all years in each group in Col1

3.Check membership by converting lists in each group to a set to eliminate duplicates. Intersect with the list d and if length is the same boolean select

    d=df.Col2.unique().tolist()
    df2=df.groupby(['Col1']).Col2.agg(list).reset_index().rename(columns={'Col2':'Lst'})#
    df[pd.merge(df,df2, how='left', on='Col1').Lst.apply(lambda e:[*{*e}&{*d}]).str.len()==len(d)]#



    Date Col1  Col2
0  1/1/16    M  2016
1  2/2/16    M  2016
2  3/3/16    M  2016
7  1/1/17    M  2017
9  2/2/18    M  2018

%%timeit 
d=df.Col2.unique().tolist()
df2=df.groupby(['Col1']).Col2.agg(list).reset_index().rename(columns={'Col2':'Lst'})
df[pd.merge(df,df2, how='left', on='Col1').Lst.apply(lambda e:[*{*e}&{*d}]).str.len()==len(d)]
7.5 ms ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
wwnde
  • 26,119
  • 6
  • 18
  • 32