2

Question

  • I want the result of Pandas .resample() method to have 2 levels of index. As you can check it out on the example below, the resampling of good_example_df produces the 2 level of index.
  • When I changed the 2nd element of index_list from datetime(2020,1,2,12,39,59) to datetime(2020,1,3,12,39,59), the .resample() method gives me only 1 level of index. The example is given with bad_example_df. It is something that I do not want to take place.
  • How can I make the result of .resample() method to have always the 2 level of index, as the good_example_df does?
  • The reason I am asking this question is that I want the result of the resample method to be consistent. I don't want the result of the method to differ depending on the input.

Code Example

import pandas as pd
from datetime import datetime, time

index_list = [datetime(2020,1,2,2,43,59), datetime(2020,1,2,12,39,59),datetime(2020,1,3,21,42,59),
              datetime(2020,1,4,2,53,59), datetime(2020,1,4,19,17,59)]
data_dict = {'return': [1.003, 1.02, 1.06, 1.02, 1.03],
             'stock' : ['AMZN', 'APPL', 'NVDA', 'MSFT', 'AMZN']}

good_example_df = pd.DataFrame(data = data_dict, index = index_list)
good_example_df

>>                    return  stock
2020-01-02  2:43:59   1.003    AMZN
2020-01-02 12:39:59   1.020    APPL
2020-01-03 21:42:59   1.060    NVDA
2020-01-04 02:53:59   1.020    MSFT
2020-01-04 19:17:59   1.030    AMZN

good_example_df.resample("D").apply(lambda x: x[:2])

>>                               return  stock
2020-01-02  2020-01-02 02:43:59   1.003   AMZN
            2020-01-02 12:39:59   1.020   APPL
2020-01-03  2020-01-03 21:42:59   1.060   NVDA
2020-01-04  2020-01-04 02:53:59   1.020   MSFT
            2020-01-04 19:17:59   1.030   AMZN

new_index_list = [datetime(2020,1,2,2,43,59), datetime(2020,1,3,12,39,59),datetime(2020,1,3,21,42,59),
                  datetime(2020,1,4,2,53,59), datetime(2020,1,4,19,17,59)]

bad_example_df = pd.DataFrame(data = data_dict, index = new_index_list)
bad_example_df

>>                    return  stock
2020-01-02  2:43:59   1.003    AMZN
2020-01-03 12:39:59   1.020    APPL
2020-01-03 21:42:59   1.060    NVDA
2020-01-04 02:53:59   1.020    MSFT
2020-01-04 19:17:59   1.030    AMZN

bad_example_df.resample("D").apply(lambda x: x[:2])

>>>                return          stock
2020-01-02         1.003            AMZN
2020-01-03  [1.02, 1.06]    [APPL, NVDA]
2020-01-04  [1.02, 1.03]    [MSFT, AMZN]

Eiffelbear
  • 399
  • 1
  • 4
  • 23
  • I think `bad_example_df` has the expected output. resample is used to aggregate values in timeseries. If you want to have two indices, you don't need to use resample. My point is if you want something like the output of `good_example_df`, resample is not needed. – Reza Jul 22 '20 at 17:37
  • @Reza I want the `.resample()` method to give me a consistent output, and I prefer the result to be consistent with the good example way. I made a function based on the .resample method and if it gives me the 2 types of results as it is doing now, my project cannot proceed. – Eiffelbear Jul 22 '20 at 17:54
  • In `good_example_df`, there is no aggregation, right? you have 5 rows as input and 5 rows as output. so you don't need to use `.resample()` – Reza Jul 23 '20 at 02:45

1 Answers1

1

Your output is basically just the same data, but with an additional index with the times rounded down to days. If that's what you are trying to achieve, don't use resample. You don't need it and you can just reset the index:

In[]:
bad_example_df.set_index([bad_example_df.index.floor('D'), bad_example_df.index])

Out[]:
                                return stock
2020-01-02 2020-01-02 02:43:59   1.003  AMZN
2020-01-03 2020-01-03 12:39:59   1.020  APPL
           2020-01-03 21:42:59   1.060  NVDA
2020-01-04 2020-01-04 02:53:59   1.020  MSFT
           2020-01-04 19:17:59   1.030  AMZN

OTOH, your lambda makes it look like you are trying to get the first two values for each day. If that's the case, I think apply is not what you want to use (presumably b/c of the way resample().apply() iterates, see here). Notice that if you change the first date to January first, you get even worse unexpected, output:

In[]:
third_index_list = [datetime(2020,1,1,2,43,59), datetime(2020,1,2,12,39,59),datetime(2020,1,3,21,42,59),
              datetime(2020,1,4,2,53,59), datetime(2020,1,4,19,17,59)]

terrible_example_df = pd.DataFrame(data = data_dict, index = third_index_list)
terrible_example_df.resample("D").apply(lambda x: x[:2])

Out[]:
                  return         stock
2020-01-01         1.003          AMZN
2020-01-02          1.02          APPL
2020-01-03          1.06          NVDA
2020-01-04  [1.02, 1.03]  [MSFT, AMZN]
#now the dtype is object and lots of operations will fail!

So I think your good_example_df just happens to give an expected output, and your bad_example_df just happens to give unexpected but still functional output, but both of these are a probably improper use of resample().apply(). TBH I don't understand what apply is doing differently in each example.

Instead, it looks like using groupby and groupby().apply() (different from resample apply!) can give you the output you want consistently (as far as I can tell):

In[]:
bad_example_df.groupby(pd.Grouper(freq='D')).apply(lambda x: x[:2])

Out[]: 
                                return stock
2020-01-02 2020-01-02 02:43:59   1.003  AMZN
2020-01-03 2020-01-03 12:39:59   1.020  APPL
           2020-01-03 21:42:59   1.060  NVDA
2020-01-04 2020-01-04 02:53:59   1.020  MSFT
           2020-01-04 19:17:59   1.030  AMZN
#works for terrible_example_df as well

And also TBH, here I don't understand why two indexes are created, but it seems to work!

Tom
  • 8,310
  • 2
  • 16
  • 36
  • What is the difference between `.groupby(pd.Grouper(freq='D'))` and `.resample('D')`? I personally prefer using `.resample()` to `.groupby()` when it comes to `.datetime` data because resample is simpler to use. But what is the difference between the two methods? – Eiffelbear Jan 28 '21 at 07:41
  • Short answer: I'm not sure. I can point you to [my post](https://stackoverflow.com/q/62902115/13386979) asking a similar question. I at least document some of the ways they are different, and there is a nice comment by cs95. The docs for `resample` note it as a "convenience method" - it ultimately does a `groupby` in some sense. I agree with you that `resample` seems friendlier. I think `groupby` can be better for doing more esoteric aggregation operations, though `resample` is maybe clearer for up or down sampling – Tom Jan 28 '21 at 15:01
  • Also, you can do `df.groupby(df.index.date)` in this case, which is a more readable alternative to the `pd.Grouper` – Tom Jan 28 '21 at 15:04