0

Looking to return a dataframe which contains the last row (the row with most recent date index) of each group, where the second level of the multi-index is filtered by a logical indexing condition.

Here is a toy example included to explain better:

import numpy as np
import pandas as pd
from datetime import datetime

dates  = pd.date_range(start='1/1/2018', end='1/4/2018').to_pydatetime().tolist() * 2
ids    = ['z7321', 'z7321', 'z7321', 'z7321', 'b2134', 'b2134', 'b2134', 'b2134']

arrays = [ids, dates]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['key', 'date'])

df = pd.DataFrame(data=np.random.randn(len(index)), index=index, columns=['change'])

print(df)

                   change
key   date
z7321 2018-01-01 -0.701605
      2018-01-02 -0.934580
      2018-01-03  0.186554
      2018-01-04  0.417024
b2134 2018-01-01  0.682699
      2018-01-02 -0.913633
      2018-01-03  0.330347
      2018-01-04 -0.706429

The condition would be to return the last row where df[df.index.get_level_values(1) <= datetime(2018, 1, 2)]

The desired output looks like this:

                   change
key   date
z7321 2018-01-02 -0.934580
b2134 2018-01-02 -0.913633

Additional Considerations:

  1. Directly selecting the rows using df[df.index.get_level_values(1) == datetime(2018, 1, 2)] isn't an option since the second index level (date level) may not contain an exact date match for the specified value of datetime(2018, 1, 2)
  2. The date index may not contain the same values across the key groups/index. i.e. 'z7321' could have different dates in the second level index than 'b2134'
eNc
  • 1,021
  • 10
  • 23
  • Does this answer your question? [Get only the first and last rows of each group with pandas](https://stackoverflow.com/questions/53927414/get-only-the-first-and-last-rows-of-each-group-with-pandas) – AMC Jan 08 '20 at 21:10
  • See also: [`.tail()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.tail.html#pandas-core-groupby-groupby-tail). – AMC Jan 08 '20 at 21:12
  • Yes it does help answer the question, I didn't come across it in my searching. Applying .tail(1) also works. – eNc Jan 08 '20 at 21:18

1 Answers1

0

As I wrote my toy example, I ended up finding a way to get the desired output. Hopefully this solution is helpful to someone else or perhaps can be improved upon.

The following provides the desired output:

df1 = df[df.index.get_level_values(1) <= datetime(2018, 1, 2)].groupby(level='key', as_index=False).nth(-1)
print(df1)

                   change
key   date
z7321 2018-01-02 -0.934580
b2134 2018-01-02 -0.913633

Which also works for cases where the second index level is inconsistent across the first level groups:

import numpy as np
import pandas as pd
from datetime import datetime

dates = pd.date_range(start='1/1/2018', end='1/4/2018').to_pydatetime().tolist()
dates += pd.date_range(start='12/29/2017', end='1/1/2018').to_pydatetime().tolist()

ids   = ['z7321', 'z7321', 'z7321', 'z7321', 'b2134', 'b2134', 'b2134', 'b2134']

arrays = [ids, dates]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['key', 'date'])

df = pd.DataFrame(data=np.random.randn(len(index)), index=index, columns=['change'])
print(df)

                    change
key   date
z7321 2018-01-01 -1.420757
      2018-01-02 -0.297835
      2018-01-03  0.693520
      2018-01-04  0.909420
b2134 2017-12-29 -1.577685
      2017-12-30  0.632395
      2017-12-31  1.158273
      2018-01-01 -0.242314


df1 = df[df.index.get_level_values(1) <= datetime(2018, 1, 2)].groupby(level='key', as_index=False).nth(-1)
print(df1)

                    change
key   date
z7321 2018-01-02 -0.297835
b2134 2018-01-01 -0.242314
eNc
  • 1,021
  • 10
  • 23