4

I have a mass pandas DataFrame df:

year          count
1983          5
1983          4
1983          7
...
2009          8
2009          11
2009          30

and I aim to sample 10 data points per year 100 times and get the mean and standard deviation of count per year. The signs of the count values are determined randomly.


I want to randomly sample 10 data per year, which can be done by:

new_df = pd.DataFrame(columns=['year', 'count'])
ref = df.year.unique()

for i in range(len(ref)):
  appended_df = df[df['year'] == ref[i]].sample(n=10)
  new_df = pd.concat([new_df,appended_df])

Then, I assign a sign to count randomly (so that by random chance the count could be positive or negative) and rename it to value, which can be done by:

vlist = []

for i in range(len(new_df)):
  if randint(0,1) == 0:
    vlist.append(new_df.count.iloc[i])
  else:
    vlist.append(new_df.count.iloc[i] * -1)

new_data['value'] = vlist

Getting a mean and standard deviation per each year is quite simple:

xdf = new_data.groupby("year").agg([np.mean, np.std]).reset_index()

But I can't seem to find an optimal way to try this sampling 100 times per year, store the mean values, and get the mean and standard deviation of those 100 means per year. I could think of using for loop, but it would take too much of a runtime.

Essentially, the output should be in the form of the following (the values are arbitrary here):

year      mean_of_100_means  total_sd
1983      4.22               0.43
1984      -6.39              1.25
1985      2.01               0.04
...
2007      11.92              3.38
2008      -5.27              1.67
2009      1.85               0.99

Any insights would be appreciated.

jstaxlin
  • 517
  • 4
  • 18

2 Answers2

3

I think you can use pandas groupby and sample functions together to take 10 samples from each year of your DataFrame. If you put this in a loop, then you can sample it 100 times, and combine the results.

It sounds like you only need the standard deviation of the 100 means (and you don't need the standard deviation of the sample of 10 observations), so you can calculate only the mean in your groupby and sample, then calculate the standard deviation from each of those 100 means when you are creating the total_sd column of your final DataFrame.

import numpy as np
import pandas as pd

np.random.seed(42)

## create a random DataFrame with 100 entries for the years 1980-1999, length 2000
df = pd.DataFrame({
    'year':[year for year in list(range(1980, 2000)) for _ in range(100)],
    'count':np.random.randint(1,100,size=2000)
})

list_of_means = []

## sample 10 observations from each year, and repeat this process 100 times, storing the mean for each year in a list
for _ in range(100):
    df_sample = df.groupby("year").sample(10).groupby("year").mean()
    list_of_means.append(df_sample['count'].tolist())
array_of_means = [np.array(x) for x in list_of_means]

result = pd.DataFrame({
    'year': df.year.unique(),
    'mean_of_100_means': [np.mean(k) for k in zip(*array_of_means)],
    'total_sd': [np.std(k) for k in zip(*array_of_means)]
})

This results in:

>>> result
    year  mean_of_100_means  total_sd
0   1980             50.316  8.656948
1   1981             48.274  8.647643
2   1982             47.958  8.598455
3   1983             49.357  7.854620
4   1984             48.977  8.523484
5   1985             49.847  7.114485
6   1986             47.338  8.220143
7   1987             48.106  9.413085
8   1988             53.487  9.237561
9   1989             47.376  9.173845
10  1990             46.141  9.061634
11  1991             46.851  7.647189
12  1992             49.389  7.743318
13  1993             52.207  9.333309
14  1994             47.271  8.177815
15  1995             52.555  8.377355
16  1996             47.606  8.668769
17  1997             52.584  8.200558
18  1998             51.993  8.695232
19  1999             49.054  8.178929
Derek O
  • 16,770
  • 4
  • 24
  • 43
3

Try:

def fn(x):
    _100_means = [x.sample(10).mean() for i in range(100)]
    return {
        "mean_of_100_means": np.mean(_100_means),
        "total_sd": np.std(_100_means),
    }


print(df.groupby("year")["count"].apply(fn).unstack().reset_index())

EDIT: Changed the computation of means.

Prints:

    year  mean_of_100_means   total_sd
0   1983             48.986   8.330787
1   1984             48.479  10.384896
2   1985             48.957   7.854900
3   1986             50.821  10.303847
4   1987             50.198   9.835832
5   1988             47.497   8.678749
6   1989             46.763   9.197387
7   1990             49.696   8.837589
8   1991             46.979   8.141969
9   1992             48.555   8.603597
10  1993             50.220   8.263946
11  1994             48.735   9.954741
12  1995             49.759   8.532844
13  1996             49.832   8.998654
14  1997             50.306   9.038316
15  1998             49.513   9.024341
16  1999             50.532   9.883166
17  2000             49.195   9.177008
18  2001             50.731   8.309244
19  2002             48.792   9.680028
20  2003             50.251   9.384759
21  2004             50.522   9.269677
22  2005             48.090   8.964458
23  2006             49.529   8.250701
24  2007             47.192   8.682196
25  2008             50.124   9.337356
26  2009             47.988   8.053438

The dataframe was created:

data = []
for y in range(1983, 2010):
    for i in np.random.randint(0, 100, size=1000):
        data.append({"year": y, "count": i})
df = pd.DataFrame(data)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Your answer definitely looks cleaner than mine, but I was wondering why the `total_sd` column has consistently higher values than mine? Our methods seem identical as far as I can tell, the only difference in our sample data being the number of observations we chose for each year – Derek O Jun 27 '21 at 07:37
  • 1
    @DerekO I re-read the OP's question and I computed the mean wrong - the OP wants to compute 100 means and then compute mean from the 100 means. I changed my function and now it looks similar. – Andrej Kesely Jun 27 '21 at 07:44
  • 1
    Ah that makes sense now. Hopefully both of our answers will help OP out! – Derek O Jun 27 '21 at 07:50