Scipy - adjust National Team means based on sample size

Question

I have a population of National Teams (32), and parameter (mean) that I want to measure for each team, aggregated per match.

For example: I get the mean scouts for all strikers for each team, per match, and then I get the the mean (or median) for all team matches.

Now, one group of teams have played 18 matches and another group has played only 8 matches for the World Cup Qualifying.

I have an hypothesis that, for two teams with equal mean value, the one with larger sample size (18) should be ranked higher.

less_than_8 = all_stats[all_stats['games']<=8]

I get values:

3     0.610759
7     0.579832
14    0.537579
20    0.346510
25    0.403606
27    0.536443

and with:

sns.displot(less_than_8, x="avg_attack",kind='kde',bw_adjust=2)

I plot:

with a mean of 0.5024547681196802

Now, for:

more_than_18 = all_stats[all_stats['games']>=18]

I get values:

0     0.148860
1     0.330585
4     0.097578
6     0.518595
8     0.220798
11    0.200142
12    0.297721
15    0.256037
17    0.195157
18    0.176994
19    0.267094
21    0.295228
22    0.248932
23    0.420940
24    0.148860
28    0.297721
30    0.350516
31    0.205128

and I plot the curve:

with a lower mean, of 0.25982701104003497.

It seems clear that sample size does affect the mean, diminishing it as size increases.

Is there a way I can adjust the means of larger sample size AS IF they were being calculated on a smaller sample size, or vice versa, using prior and posteriori assumptions?

NOTE. I have std for all teams.

There is a proposed solution for a similar matter, using Empirical Bayes estimation and a beta distribution, which can be seen here Understanding empirical Bayes estimation (using baseball statistics), but I'm not sure as to how it could prior means could be extrapolated from successful attempts.

I think this question is better suited to https://stats.stackexchange.com/ My personal understanding is that increasing sample size doesn't affect the mean monotonically - only that it will bring it closer to the population mean. — Josh Friedlander, Oct 23 '22 at 07:33

score 0 · Answer 1 · edited Nov 27 '22 at 12:13

Sample size does affect the mean, but it's not exactly like mean should increase or decrease when sample size is increased. Moreover; sample size will get closer and closer to the population mean μ and standard deviation σ.

I cannot give an exact proposal without more information like; how many datapoints per team, per match, what are the standard deviations in these values. But just looking at the details I have to presume the 6 teams qualified with only 8 matches somehow smashed whatever the stat you have measured. (Probably this is why they only played 8 matches?)

I can make a few simple proposals based on the fact that you would want to rank these teams;

Proposal 1:

Extend these stats and calculate a population mean, std for a season. (If you have prior seasons use them as well)
Use this mean value to rank teams (Without any sample adjustments) - This would likely result in the 6 teams landing on top

Proposal 2:

Calculate per game mean across all teams(call it mean_gt) [ for game 01. mean for game 02.. or mean for game in Week 01, Week 02 ] - I recommend based on week as 6 teams will only have 8 games and this would bias datapoints in the beginning or end.
plot mean_gt and compare each team's given Week's mean with mean_gt [ call this diff diff_gt]
diff_gt gives a better perspective of a team's performance on each week. So you can get a mean of this value to rank teams. When filling datapoint for 6 teams with 8 matches I suggest using the population mean rather than extrapolating to keep things simple. But it's possible to get creative; like using the difference of aggregate total for 32 teams also. Like [ 32*mean_gt_of_week_1 - total of [32-x] teams]/x

I do have another idea. But rather wait for a feedback as I am way off the simple solution for adjusting a sample mean. :)

Thank you. Mean values above are alredy sampled from matches. I get the mean for each match, and the values you see above are means of means, over all matches. In my dataset I also have std calculated over all matches — 8-Bit Borges, Oct 17 '22 at 14:46
You mean the original dataset is not available to you? To implement my second proposal (which I believe is a very fair metric to rank all teams in one scale) its mandatory to have each teams game by game performance values. Sadly if you only have means of teams I fear the maximum you can try to do is some extrapolation. Results would probably duplicate. — spramuditha, Oct 17 '22 at 23:00
per match are values aggregated from individual PLAYERS. and then TEAM means are calculated for the number of matches — 8-Bit Borges, Oct 18 '22 at 01:44
Sorry about the delay. But this means you are able to follow the path I proposed in the second section. Have you attempted it? — spramuditha, Oct 21 '22 at 00:28

Scipy - adjust National Team means based on sample size

1 Answers1