0

I'm trying to take a series of tweets and group them into 1 hour intervals based on when the tweets were created, and sum the likes the tweets got for each 1 hour interval.

The tweets have been converted to a pandas dataframe, eg:

df.head(1)
    author_id   username    author_followers    author_tweets   author_description  author_location text    created_at  lang    tweet_id    retweets    replies likes   quotes
0   2395138046  WorldCoinIndex  12832   46121   Cryptocurrency index | prices | 24hr volume | ...   None    Cryptocurrencies $ETH $LTC $DASH $XMR $ZCASH h...   2022-02-11 23:59:38+00:00   en  1492287240990507009 0   1   0   0

EXPECTATION

The code i'm applying to the above dataframe:

df.likes.resample('H', on='created_at').sum()

My understanding is likes specifies the column to be summed, 'H' specifies the 1 hour time intervals, and the on parameter defines the time series key created_at. based on the time series key parameter created_at.

RESULTING ERROR MESSAGE

KeyError: 'The grouper name created_at is not found'

ASSESSMENT

When I search that error message, I see mostly references for the groupby method, which I considered, but figured Time Series would be simpler.

Shouldn't it return an index error if it's the 'created_at' parameter that's problematic?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
dsx
  • 167
  • 1
  • 12

1 Answers1

1

Based on documentation:

on str, optional For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

When you use the .likes actually there is no column with the specified name on the column you try to sample over. Example:

index = pd.date_range('1/1/2000', periods=9, freq='T')
df = pd.DataFrame({'likes': range(9), 'user':['ali' for i in range(9)]}, index=index)
df['create on'] = df.index

This produce the error:

df.likes.resample('3T', on = 'create on').sum()

And the right way:

df.resample('3T', on = 'create on').sum()

the output:

enter image description here

keramat
  • 4,328
  • 6
  • 25
  • 38
  • Something like `df.resample('3T', on='created_at')['likes'].sum()` is more likely what OP is looking for. As without specifying the column (or columns) pandas will try apply `sum` to all columns in the DataFrame which is likely to fail given all of the different column types in the shown sample data. – Henry Ecker Feb 13 '22 at 06:45
  • @HenryEcker it sums numeric only data, why would it fail? In fact \@keramat's data has nonnumerics and it works –  Feb 13 '22 at 07:39
  • @does it matter, he means with main question example. – keramat Feb 13 '22 at 07:42
  • @keramat doesnT matter –  Feb 13 '22 at 07:43
  • @does it matter, You are right, but I prefer to restrict the result as much as possible. – keramat Feb 13 '22 at 07:45
  • @keramat yeah you're not doing that currently. my objection is to failure not to what to keep and not. –  Feb 13 '22 at 07:46