I have a very large dataset about twitter. I want to be able to compute the mean tweets per hour published by the user. I was able to groupby the tweets per hour per user but now how can I compute the mean per hour?
I'm not able to write all the code since the dataset has been heavily preprocessed. In the dataset I have as column user_id
and created_at
which is a timestamp of the tweet published, so I sorted by created_at
and than groupedby till hours
grouped_df = tweets_df.sort_values(["created_at"]).groupby([
tweets_df['user_id'],
tweets_df['created_at'].dt.year,
tweets_df['created_at'].dt.month,
tweets_df['created_at'].dt.day,
tweets_df['created_at'].dt.hour])
I can count the tweets per hours per user using
tweet_per_hour = grouped_df["created_at"].count()
print(tweet_per_hour)
what I obtain using this code is
user_id created_at created_at created_at created_at
678033 2012 3 11 2 1
14 1
17 1
18 1
4 13 4 1
..
3164941860 2020 4 30 7 6
9 2
5 1 1 2
9 6
2 6 1
Name: created_at, Length: 3829888, dtype: int64
where the last column is the count of the tweets per hours
678033 2012 3 11 2 1
indicates that user the 678033 in the day 2012-03-11 in the range of hour between 2 o'clock and 3 o'clock made just 1 tweet.
I need to sum all the tweets per hour made by the user and compute a mean for that user So I want as output for example
user_id average_tweets_per_hour
678033 4
665353 10
How can i do it?
EDIT This is the reproducible example I have df_t and df_u, df_u_new is what I want to get
import pandas as pd
import numpy as np
df_t = pd.DataFrame({'id_t': [0, 1, 2, 3], 'id_u': [1, 1, 1, 2], 'timestamp': ["2019-06-27 11:12:32", "2019-06-27 11:14:32", "2020-07-28 11:24:32", "2020-02-27 13:30:21"]})
print(df_t)
df_u = pd.DataFrame({'id_u': [1, 2]})
print()
print(df_u)
df_u_new = pd.DataFrame({'id_u': [1, 2], 'avg_t_per_h': [2, 1]})
print()
print(df_u_new)