0

In R I have 2 datasets: group1 and group2.

For group 1 I have 10 game_id which is the id of a game, and we have number which is the numbers of times this games has been played in group1.

So if we type

group1

we get this output

game_id  number
1        758565
2        235289
...
10       87084

For group2 we get

game_id  number
1        79310
2        28564
...
10       9048

If I want to test if there is a statistical difference between group1 and group2 for the first 2 game_id I can use Pearson chi-square test.

In R I simply create the matrix

# The first 2 'numbers' in group1
a <- c( group1[1,2] , group1[2,2] )
# The first 2 'numbers' in group2
b <- c( group2[1,2], group2[2,2] )
# Creating it on matrix-form
m <- rbind(a,b)

So m gives us

a 758565  235289
b 79310  28564

Here I can test H: "a is independent from b", meaning that users in group1 play game_id 1 more than 2 compared to group2.

In R we type chisq.test(m) and we get a very low p-value meaning that we can reject H, meaning that a and b is not independent.

How should one find game_id's that are played significantly more in group1 than in group2 ?

Ole Petersen
  • 670
  • 9
  • 21
  • Your chi-squared test is statistically invalid – alexwhitworth Sep 29 '15 at 11:13
  • Why is it not valid? – Ole Petersen Sep 29 '15 at 11:27
  • Because it violates the assumptions of a Pearson's chi squared test. The events in your contingency table must be mutually exclusive and sum to 1, you're only considering a partial table vs the 10 x 10 table that you should be (that is assuming that players can only play games 1-10 and that these aren't merely the counts of the top 10 ranks). – alexwhitworth Sep 29 '15 at 11:33
  • I don't know how you classified users as good/bad, but for each game you need to know a) how many good users played and how many good users didn't play, b) how many bad users played and how many bad users didn't play. Then you can compare those percentages for each game. – AntoniosK Sep 29 '15 at 11:41
  • So how should I then approach the problem. If my contingency table should sum to run I could make a new column that shows the percentage a fixed game_id has been played in group1. For example for game_id 1 we get 758565/sum(group1[,2]) = 9%. I do this for all game_id's it sums to 1. – Ole Petersen Sep 29 '15 at 11:43
  • By doing that you just sum all the people that played all games to find out the total number of good users. That assumes that no player played more than one game. I think you double count players like this. You have to be able to find the unique number of total good and bad users you have in this analysis. – AntoniosK Sep 29 '15 at 11:50
  • The number of good player (their data is in 'group1') is 6963 and the number of bad users (their data is in 'group2') is 4217. But it should be enough just to look at the 'number' or percentage a game_id has been played in a group and compare these ? So for game_id 1 cover 9% of all 'numbers' in group 1 and 4% in group 2. Then I should be able to make the test chisq.test( rbind( c(0.09, 1 -0.09) , c(0.04, 1-0.04) ) ), right? – Ole Petersen Sep 29 '15 at 11:56
  • The `chisq.test` must take as inputs the counts and not the percentages as you have here. I think you're right about the number of users, as the `number` you have in your 2nd column is the number of games played (by any user). I'll have a look. – AntoniosK Sep 29 '15 at 12:05
  • Sorry, I mistyped earlier (it was 4am and I couldn't sleep). It's not that the counts sum to 1 but the P(event outcomes) sum to 1. (ie- the categories are mutually exclusive and completely partition the sample space.) You can't run a chi-squared test on proportions... – alexwhitworth Sep 29 '15 at 15:47

1 Answers1

2

I created a simpler version of only 3 games. I'm using a chi squared test and a proportions comparison test. Personally, I prefer the second one as it gives you an idea about what percentages you're comparing. Run the script and make sure you understand the process.

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084))

dt_group1

#   game_id number_games
# 1       1       758565
# 2       2       235289
# 3       3        87084


# add extra variables
dt_group1$number_rest_games = sum(dt_group1$number_games) - dt_group1$number_games   # needed for chisq.test
dt_group1$number_all_games = sum(dt_group1$number_games)  # needed for prop.test
dt_group1$Prc = dt_group1$number_games / dt_group1$number_all_games  # just to get an idea about the percentages

dt_group1

#   game_id number_games number_rest_games number_all_games        Prc
# 1       1       758565            322373          1080938 0.70176550
# 2       2       235289            845649          1080938 0.21767113
# 3       3        87084            993854          1080938 0.08056336



# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048))

# add extra variables
dt_group2$number_rest_games = sum(dt_group2$number_games) - dt_group2$number_games
dt_group2$number_all_games = sum(dt_group2$number_games)
dt_group2$Prc = dt_group2$number_games / dt_group2$number_all_games




# input the game id you want to investigate
input_game_id = 1

# create a table of successes (games played) and failures (games not played)
dt_test = rbind(c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group1$number_rest_games[dt_group1$game_id==input_game_id]),
                c(dt_group2$number_games[dt_group2$game_id==input_game_id], dt_group2$number_rest_games[dt_group2$game_id==input_game_id]))

# perform chi sq test
chisq.test(dt_test)

# Pearson's Chi-squared test with Yates' continuity correction
# 
# data:  dt_test
# X-squared = 275.9, df = 1, p-value < 2.2e-16


# create a vector of successes (games played) and vector of total games
x = c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group2$number_games[dt_group2$game_id==input_game_id])
y = c(dt_group1$number_all_games[dt_group1$game_id==input_game_id], dt_group2$number_all_games[dt_group2$game_id==input_game_id])

# perform test of proportions
prop.test(x,y)

# 2-sample test for equality of proportions with continuity correction
# 
# data:  x out of y
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# alternative hypothesis: two.sided
# 95 percent confidence interval:
#   0.02063233 0.02626776
# sample estimates:
#   prop 1    prop 2 
# 0.7017655 0.6783155 

The main thing is that chisq.test is a test that compares counts/proportions, so you need to provide the number of "successes" and "failures" for the groups you compare (contingency table as input). prop.test is another counts/proportions testing command that you need to provide the number of "successes" and "totals".

Now that you're happy with the result and you saw how the process works I'll add a more efficient way to perform those tests.

The first one is using dplyr and broom packages:

library(dplyr)
library(broom)

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084),
                       group_id = 1)  ## adding the id of the group

# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048),
                       group_id = 2)  ## adding the id of the group

# combine datasets
dt = rbind(dt_group1, dt_group2)


dt %>%
  group_by(group_id) %>%                                           # for each group id
  mutate(number_all_games = sum(number_games),                     # create new columns
         number_rest_games = number_all_games - number_games,
         Prc = number_games / number_all_games) %>%
  group_by(game_id) %>%                                            # for each game
  do(tidy(prop.test(.$number_games, .$number_all_games))) %>%      # perform the test
  ungroup()


#   game_id  estimate1  estimate2 statistic      p.value parameter     conf.low    conf.high
#     (int)      (dbl)      (dbl)     (dbl)        (dbl)     (dbl)        (dbl)        (dbl)
# 1       1 0.70176550 0.67831546 275.89973 5.876772e-62         1  0.020632330  0.026267761
# 2       2 0.21767113 0.24429962 435.44091 1.063385e-96         1 -0.029216006 -0.024040964
# 3       3 0.08056336 0.07738492  14.39768 1.479844e-04         1  0.001558471  0.004798407

The other one is using data.table and broom packages:

library(data.table)
library(broom)

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084),
                       group_id = 1)  ## adding the id of the group

# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048),
                       group_id = 2)  ## adding the id of the group

# combine datasets
dt = data.table(rbind(dt_group1, dt_group2))

# create new columns for each group
dt[, number_all_games := sum(number_games), by=group_id]

dt[, `:=`(number_rest_games = number_all_games - number_games,
          Prc = number_games / number_all_games) , by=group_id]

# for each game id compare percentages
dt[, tidy(prop.test(.SD$number_games, .SD$number_all_games)) , by=game_id]


#    game_id  estimate1  estimate2 statistic      p.value parameter     conf.low    conf.high
# 1:       1 0.70176550 0.67831546 275.89973 5.876772e-62         1  0.020632330  0.026267761
# 2:       2 0.21767113 0.24429962 435.44091 1.063385e-96         1 -0.029216006 -0.024040964
# 3:       3 0.08056336 0.07738492  14.39768 1.479844e-04         1  0.001558471  0.004798407

You can see that each row represent one game and the comparison is between group 1 and 2. You can get the p values from the corresponding column, but other info of the test/comparison as well.

AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • Thanks. It all make sense. I tried it on all my 10 game_id's and I got a low p-value on all of them. This means that for no game we have independence between group1 and group2. I find this a little strange. – Ole Petersen Sep 29 '15 at 14:18
  • Don't forget that when you deal with statistical significance and p-values, any difference, no matter how small it is, can be found statistically significant given that you have a large (enough) number of observations. That's why there's the concept of "efficient sample size" when you want to perform those comparisons. Have a look at that and how to design experiments for percentage comparisons. – AntoniosK Sep 29 '15 at 15:02
  • I have a lot of data so the power or sample size should not be a problem. I see it this way: For each game there simply is a dependence between the two groups. So bad users have some "popular" games and good users have their popular games. None of the games are equally popular for both groups. – Ole Petersen Sep 29 '15 at 15:11
  • Exactly because you have lots of data you should expect even small differences to be captured/classified as statistically significant. I'm also planning to update my answer (add a more efficient way) now that you are fine with the results – AntoniosK Sep 29 '15 at 15:20
  • In terms of interpretation you're right. Some games are more likely to be played by good users and other by bad users. BUT, it is important when you report your findings to clarify that in your case you have lots of data and you expect to classify small differences as statistically significant differences. There's nothing wrong with that, but maybe your company won't care about a very small statistically significant difference. That's the difference between "statistically significant impact" and "real impact". – AntoniosK Sep 29 '15 at 15:31
  • How can one then decite when it's a small difference and a real impact? – Ole Petersen Sep 30 '15 at 05:32
  • 1
    The point is that you have to decide in advance what you consider as a real impact and you try to collect as many observations (efficient sample size) you need in order to capture that impact/difference as statistically significant. You should search for "experiment design", "AB test design", etc. to see how it works. What you can do after the analysis, without an experiment design, is to report your findings as mentioned above and let the company decide how to deal with small statistically significant differences. – AntoniosK Sep 30 '15 at 08:46
  • Because I have a large sample size you say that it's more likely to expect small differences as statistically significant differences. Why is that? A large sample size should not make the "dependence" more likely I think. – Ole Petersen Sep 30 '15 at 09:59
  • It's all up to how the tests work and how the power and efficient sample size is defined. As an example compare the percentages 30% vs. 70% when you have 10 observations per group and when you have 100 per group. Like : `prop.test(c(3,7),c(10,10))` and `prop.test(c(30,70),c(100,100))`. Same happens with small differences when compares thousands and tens of thousands. – AntoniosK Sep 30 '15 at 10:04
  • It make sense. The way I see it: We have "dependence" for a large sample size. Doing the same for a small sample size we may not get "dependence" because it's "hidden" - we have not found it yet. I tried to run some dependence-test in R for random vectors (using the sample-function). If I increase the size in this case I never get "dependence" because there are none - so in that case it was not "hidden". – Ole Petersen Sep 30 '15 at 10:30