Significance testing for multiple levels of groupings using counts/proportions with missing values

Question

With an example data set that looks like

data.frame(
  Treatment = c("A", "A", "A", "A", "A", "A",
                "A", "A", "A", "A", "A", "A",
                "B", "B", "B", "B", "B", "B",
                "B", "B", "B", "B", "B", "B"),
  Patient = c(1, 1, 1, 1, 1, 1,
              2, 2, 2, 2, 2, 2,
              3, 3, 3, 3, 3, 3,
              4, 4, 4, 4, 4, 4),
  Timepoint = c("PRE", "PRE", "PRE", "POST", "POST", "POST",
                "PRE", "PRE", "PRE", "POST", "POST", "POST",
                "PRE", "PRE", "PRE", "POST", "POST", "POST",
                "PRE", "PRE", "PRE", "POST", "POST", "POST"),
  Phenotype = c("NK", "T Cell", "Macrophage", "NK", "T Cell", "Macrophage",
                "NK", "T Cell", "Macrophage", "NK", "T Cell", "Macrophage",
                "NK", "T Cell", "Macrophage", "NK", "T Cell", "Macrophage",
                "NK", "T Cell", "Macrophage", "NK", "T Cell", "Macrophage"),
  Count = c(523,235,2352,352,646,234,
            3463,525,646,234,725,264,
            1636,3153,455,134,646,253,
            464,252,464,276,364,353)
)

There are two levels of comparisons I'm trying to make:

the first would be between the PRE and POST timepoints for each phenotype with an output that would look something like:

data.frame(
  Patient = c(1, 1, 1,
              2, 2, 2,
              3, 3, 3,
              4, 4, 4),
  Phenotype = c("NK", "T Cell", "Macrophage",
                "NK", "T Cell", "Macrophage",
                "NK", "T Cell", "Macrophage",
                "NK", "T Cell", "Macrophage"),
  Pvalue = c(0, 0, 0,
             0, 0, 0,
             0, 0, 0,
             0, 0, 0)
)

the second would be a higher level comparison using the treatment grouping producing something similar to this:

data.frame(
  Treatment = c("A", "A", "A",
              "B", "B", "B"),
  Phenotype = c("NK", "T Cell", "Macrophage",
                "NK", "T Cell", "Macrophage"),
  Pvalue = c(0, 0, 0,
             0, 0, 0)
)

Since I'm comparing proportions/counts here, I'm assuming I'd do a proportion test or a chi square test? I've eliminated pairwise testing because there are incongruent numbers of observations I'm still not sure which would be the more appropriate for this comparison.

I've tried this with dplyr:

df %>% group_by(Phenotype, Timepoint, Patient) %>% 
  summarise(pvalue = chisq.test(n)$p.value)

but it fails because in the actual dataset there are some cases where a phenotype will be present in one of the timepoints but not the other for some of the patients. What would be the best way to run these kinds of tests in bulk where a phenotype is 0 or NA in some of the cases? The actual dataset is much larger than the dummy set I've provided, so running them manually isn't the most efficient option.

I appreciate any input!

Significance testing for multiple levels of groupings using counts/proportions with missing values

0 Answers0