0
age categories false positive count total count proportion (FP/Total) %
40 - 45 25 100 25.0
45 - 50 25 68 36.8
50 - 55 50 250 20.0
55 - 60 82 317 25.9

I have this data frame on R. it shows the false positive counts and the total count within each respective age categories. I have also added a column with the calculated % of False positives / total count.

Essentially, I want to be able to plot this as a graph - which I can do.


graph <- ggplot(data = hi, aes(x = age_categories, y = prop)) +   geom_bar(stat = "identity", fill = "light blue") + labs(x = "percentage", y = "Percentage of False Positives", 
fill = NULL ,title = " False Positives by age categories")

But I am struggling to find the P values to show if there is any significant different in the false positive counts between each of the age categories.

So for e.g I want to see if there is a significant difference (P-value) between those between '40-45' and the other age categories.

Any help would be much appreciated!

fireplush
  • 25
  • 5
  • There are many different ways of conducting inference on a proportion. A typical approach would be an ANOVA to get a chi-square test statistics. You would usually want to use the raw data instead of the aggregates you calculated. You also could bootstrap it. – socialscientist Aug 22 '22 at 11:36
  • Does this answer your question? [How to perform single factor ANOVA in R with samples organized by column?](https://stackoverflow.com/questions/14206154/how-to-perform-single-factor-anova-in-r-with-samples-organized-by-column) – socialscientist Aug 22 '22 at 11:37

1 Answers1

0

So basically you want the chi-square for this table, if I understand correctly your question:

Age group FP TP
40-45 25 75
45 - 50 25 43
50 - 55 50 200
55 - 60 82 235

You can get an overall P-value by calculating a chi-square statistic for that table:

cbind(c(25,25,50,82), c(75,43,200,235)) |> chisq.test()

If you want to specifically contrast the first level with the others, you can use a logistic regression:

library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- tribble(
  ~ age_group, ~FP, ~TP, ~weight,
  1, 1, 0, 25,
  1, 0, 1, 75,
  2, 1, 0, 25,
  2, 0, 1, 43,
  3, 1, 0, 50,
  3, 0, 1, 200,
  4, 1, 0, 82,
  4, 0, 1, 235
) |> 
  mutate(age_group = factor(age_group))

glm(FP ~ age_group, family=binomial, weight=weight, data=df) |> summary()
#> 
#> Call:
#> glm(formula = FP ~ age_group, family = binomial, data = df, weights = weight)
#> 
#> Deviance Residuals: 
#>       1        2        3        4        5        6        7        8  
#>   8.325   -6.569    7.073   -6.278   12.686   -9.448   14.892  -11.861  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -1.09861    0.23094  -4.757 1.96e-06 ***
#> age_group2   0.55629    0.34145   1.629    0.103    
#> age_group3  -0.28768    0.27988  -1.028    0.304    
#> age_group4   0.04575    0.26417   0.173    0.863    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 822.77  on 7  degrees of freedom
#> Residual deviance: 814.55  on 4  degrees of freedom
#> AIC: 822.55
#> 
#> Number of Fisher Scoring iterations: 5

Created on 2022-08-22 by the reprex package (v2.0.1)

Changing the factor levels or the contrast matrix you can also get P-values for any of the levels compared to the others.

Claudio
  • 1,229
  • 6
  • 11