1

I have a pandas dataframe with 3 columns where:

  • Category dtype - string
  • Date dtype - datetime
  • Values dtype - float

    df = pd.DataFrame()
    df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
    df['date'] = ['2018-01-01', '2018-01-01', '2018-01-03', '2018-01-05', '2018-01-01', '2018-01-02', '2018-01-06', '2018-01-03', '2018-01-04','2018-01-01']
    df['values'] = [1, 2, -1.5, 2.3, 5, -0.7, -5.2, -5.2, 1, -1.1]
    df
    

Dataframe view

I want to filter for rows that have a positive value and a negative value (with the least difference) close to that date per category.

So, essentially an output that looks like:

df = pd.DataFrame()
df['category'] = ['a', 'a','b', 'b', 'c', 'c']
df['date'] = ['2018-01-01', '2018-01-01', '2018-01-01', '2018-01-03', '2018-01-01', '2018-01-03']
df['values'] = [1, -1.1, 2, -1.5, 5, -5.2]
df

Filtered Dataframe

I have looked at similar queries on SO (Identifying closest value in a column for each filter using Pandas, How do I find the closest values in a Pandas series to an input number?)

The first one utilises idxmin, which returns first occurence, not the closest in value.

The second link is speaking about a specific value as an input - I don't think a pure np.argsort works in my case.

I can imagine using a complex web of if statements to do this, but, I'm not sure what the most efficient way of doing this is with pandas.

Any guidance would be greatly appreciated.

DaytaSigntist
  • 117
  • 2
  • 8
  • 2
    Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Sep 17 '18 at 10:50
  • Thanks @jezrael, I have updated my query. – DaytaSigntist Sep 17 '18 at 12:12

1 Answers1

0

IIUC, sort your dataframe first then use idxmin:

df1 = df.sort_values(['category','date'])
df1[df1.groupby('category')['values']\
       .transform(lambda x: x.index.isin([x.ge(0).idxmin(), x.lt(0).idxmin()]))]

Output:

  category        date  values
0        a  2018-01-01     1.0
9        a  2018-01-01    -1.1
1        b  2018-01-01     2.0
2        b  2018-01-03    -1.5
4        c  2018-01-01     5.0
7        c  2018-01-03    -5.2
Scott Boston
  • 147,308
  • 15
  • 139
  • 187