I have tried the code for my own data. It works when I compute the sum. However, If I assign the index to the new dataframe, an error occurred.
I noticed that it's because sometimes my df
have no data in between the custom_dates
. I still want to assign the custom_dates
as index to custom_sum
.
A small adjustment to the original code:
import pandas as pd
import numpy as np
import datetime
np.random.seed(100)
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
df.index = pd.DatetimeIndex([datetime.date(2016,1,1),
datetime.date(2016,1,5),
datetime.date(2016,2,1),
datetime.date(2016,2,2),
datetime.date(2016,2,5),
datetime.date(2016,2,7),
datetime.date(2016,2,21),
datetime.date(2016,2,28),
datetime.date(2016,2,29),
datetime.date(2016,3,1)
])
custom_dates = pd.DatetimeIndex([datetime.date(2016,1,1),
datetime.date(2016,2,8),
datetime.date(2016,2,10),
datetime.date(2016,3,1)
])
custom_sum = df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum()
And this code
custom_dates.searchsorted(df.index)
gives me
array([0, 1, 1, 1, 1, 1, 3, 3, 3, 3], dtype=int64)
That's exactly "my df have no data in between the custom_dates" because df
have no data in between datetime.date(2016,2,8)
and datetime.date(2016,2,10)
Now if I assign the custom_dates
as index to custom_sum
.
custom_sum.index = custom_dates
The following error occurred:
ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements
As for my own data. My custom_dates
gives
dtype='datetime64[ns]', name='date_time', length=46899, freq=None
and my df.index
gives
dtype='datetime64[ns]', name='time_index', length=6363585, freq=None
I would expect all the actual dates in custom_sum by
custom_sum = df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum()
However, the code:
df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum()
gives an error
IndexError: index 46899 is out of bounds for axis 0 with size 46899
I can only do the
custom_dates.searchsorted(df.index)
which gives
array([ 0, 0, 0, ..., 46899, 46899, 46899], dtype=int64)
but there is no actual dates. So my question is why should I get an error in df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum()
but it works for the example?
Am I missing anything here? Any suggestions/comments? Thanks!