I have a mass pandas
DataFrame df
:
year count
1983 5
1983 4
1983 7
...
2009 8
2009 11
2009 30
and I aim to sample 10 data points per year
100 times and get the mean and standard deviation of count
per year. The signs of the count
values are determined randomly.
I want to randomly sample 10 data per year
, which can be done by:
new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()
for i in range(len(ref)):
appended_df = df[df['year'] == ref[i]].sample(n=10)
new_df = pd.concat([new_df,appended_df])
Then, I assign a sign to count
randomly (so that by random chance the count
could be positive or negative) and rename it to value
, which can be done by:
vlist = []
for i in range(len(new_df)):
if randint(0,1) == 0:
vlist.append(new_df.count.iloc[i])
else:
vlist.append(new_df.count.iloc[i] * -1)
new_data['value'] = vlist
Getting a mean and standard deviation per each year
is quite simple:
xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()
But I can't seem to find an optimal way to try this sampling 100 times per year
, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for
loop, but it would take too much of a runtime.
Essentially, the output should be in the form of the following (the value
s are arbitrary here):
year mean_of_100_means total_sd
1983 4.22 0.43
1984 -6.39 1.25
1985 2.01 0.04
...
2007 11.92 3.38
2008 -5.27 1.67
2009 1.85 0.99
Any insights would be appreciated.