0

I have tried the code for my own data. It works when I compute the sum. However, If I assign the index to the new dataframe, an error occurred. I noticed that it's because sometimes my df have no data in between the custom_dates. I still want to assign the custom_dates as index to custom_sum.

A small adjustment to the original code:

import pandas as pd
import numpy as np
import datetime

np.random.seed(100)
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))

df.index = pd.DatetimeIndex([datetime.date(2016,1,1),
                              datetime.date(2016,1,5),
                                 datetime.date(2016,2,1),
                             datetime.date(2016,2,2),
                              datetime.date(2016,2,5),
                                 datetime.date(2016,2,7),
                             datetime.date(2016,2,21),
                             datetime.date(2016,2,28),
                                 datetime.date(2016,2,29),
                             datetime.date(2016,3,1)
                            ])

custom_dates = pd.DatetimeIndex([datetime.date(2016,1,1),
                             datetime.date(2016,2,8),
                                 datetime.date(2016,2,10),
                             datetime.date(2016,3,1)
                            ])

custom_sum = df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum()

And this code

custom_dates.searchsorted(df.index)

gives me

array([0, 1, 1, 1, 1, 1, 3, 3, 3, 3], dtype=int64)

That's exactly "my df have no data in between the custom_dates" because df have no data in between datetime.date(2016,2,8) and datetime.date(2016,2,10)

Now if I assign the custom_dates as index to custom_sum.

custom_sum.index = custom_dates

The following error occurred:

ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements

As for my own data. My custom_dates gives

dtype='datetime64[ns]', name='date_time', length=46899, freq=None

and my df.index gives

dtype='datetime64[ns]', name='time_index', length=6363585, freq=None

I would expect all the actual dates in custom_sum by custom_sum = df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum() However, the code:

df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum()

gives an error

IndexError: index 46899 is out of bounds for axis 0 with size 46899

I can only do the

custom_dates.searchsorted(df.index)

which gives

array([    0,     0,     0, ..., 46899, 46899, 46899], dtype=int64)

but there is no actual dates. So my question is why should I get an error in df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum() but it works for the example? Am I missing anything here? Any suggestions/comments? Thanks!

Jian Xu
  • 15
  • 6
  • It looks like you have more dates than sums. Are you sure you have the right dates in ```custom_dates```? – Chaos_Is_Harmony Aug 10 '21 at 22:56
  • [Please do not upload images of code/errors when asking a question.](//meta.stackoverflow.com/q/285551) – Michael Delgado Aug 11 '21 at 00:04
  • @MichaelDelgado I don't think I have uploaded images of code/errors. What I add was the link to the profile of jezrael – Jian Xu Aug 11 '21 at 08:01
  • @Chaos_Is_Harmony Yes, I agree with you. It's because sometimes my df have no data in between the custom_dates that results in more dates than sums. And I have the right dates in 'custom_dates'. – Jian Xu Aug 11 '21 at 08:11
  • @JianXu But that's a problem: the two can't be unequal in length... So is your actual question regarding how to drop the rows that don't match your ```custom_dates```? – Chaos_Is_Harmony Aug 11 '21 at 09:49
  • @Chaos_Is_Harmony Yes. Do you have any idea how to drop the rows that don't match `custom_dates`? Then the length can match. And then I can assign the new `custom_dates` to `custom_sum.index`. – Jian Xu Aug 11 '21 at 10:30
  • try this: ```df = df.drop(~df.index.isin(custom_dates))``` – Chaos_Is_Harmony Aug 11 '21 at 10:34
  • @Chaos_Is_Harmony this code gives an error: `KeyError: '[False True True True True True True True True False] not found in axis'` – Jian Xu Aug 11 '21 at 10:37
  • Should it be the other way, because `custom_dates` has more rows than `custom_sum`. So it's the rows in `custom__dates` be dropped. – Jian Xu Aug 11 '21 at 10:40
  • Yeah, mea culpa; wrong syntax. I think the following should work: ```df = df[df.index.isin(custom_dates)]``` – Chaos_Is_Harmony Aug 11 '21 at 10:41
  • @Chaos_Is_Harmony sytnax is right now, but still got the error: `ValueError: Length mismatch: Expected axis has 2 elements, new values have 4 elements`. It's the rows in `custom__dates` should be dropped – Jian Xu Aug 11 '21 at 10:48
  • hmmm... So I finally had a chance to run all your code and see that ```custom_sum``` only has 3 rows, but you have 4 dates in ```custom_dates```. The dates I'm getting when just printing out ```custom_sum``` does not have ```datetime.date(2016,2,10)```. Remove that from your ```custom_dates``` and I suspect it will work. – Chaos_Is_Harmony Aug 11 '21 at 10:54
  • @Chaos_Is_Harmony Yeah, That's the problem! I can easily remove that `datetime.date(2016,2,10)` here. But I have 600m rows for my own df data and 5m rows in custom_dates. It's almost impossible to do it one by one manually. Can I do it automatically? – Jian Xu Aug 11 '21 at 10:59
  • Well, the result of the ```sum()``` operation leaves you with a column of all the actual dates. Can't you just use those as the index? Or are you trying to see which of the dates are in the ```custom_dates``` ```Series```? – Chaos_Is_Harmony Aug 11 '21 at 11:06
  • @Chaos_Is_Harmony Exactly! But if I use my own data. My `custom_dates` gives me `dtype='datetime64[ns]', name='date_time', length=46899, freq=None` and my `df.index` gives me `dtype='datetime64[ns]', name='time_index', length=6363585, freq=None`. I would expect all the actual dates in the `sum()` operation. However, the operation of `df.groupby(custom_dates[custom_dates.searchsorted(df.index)]).sum()` gives an error `IndexError: index 46899 is out of bounds for axis 0 with size 46899` I can only do the `custom_dates.searchsorted(df.index)` but there is no actual dates. I'm so desperate. – Jian Xu Aug 13 '21 at 09:34
  • Remember that Python is zero indexed, meaning its index starts at 0, not 1. One thing to keep in mind is that ```searchsorted()``` will return where you should insert a given ```value```. This means that you may need to insert it at the end (in your case, index 46899). If you notice, your ```length=46899```, that means it goes only until index 46898. A classic off by one error. That's why you're getting an index out of bounds error: because there is nothing there. A simpler example: ```arr = [1,2,3] print(arr[3]) IndexError: index 3 is out of bound ``` – Chaos_Is_Harmony Aug 13 '21 at 10:50
  • 1
    @Chaos_Is_Harmony Thanks for your help!!! Problem solved after inserting index 46899. – Jian Xu Aug 13 '21 at 13:53

1 Answers1

1

In this case, the error is saying that there are only 3 items in custom_sum, whereas custom_dates lists 4 dates. Removing the errant date (datetime.date(2016,2,10) in this case) should solve the dimension issue.

But in general, to save a new DataFrame with only the rows that meet a certain condition, you can use:

new_df = custom_sum[custom_sum.index.isin(custom_dates)]

There's a way to do it with DataFrame.drop() as well. Not sure which is more efficient or desirable. But I would suspect that doing it with df.drop() and using the inplace=True parameter would likely save on memory given that it won't create a new DataFrame object--though someone correct me if I'm wrong on that assumption.

Chaos_Is_Harmony
  • 498
  • 5
  • 17