Influence of subtotals on significance tests in expss tables

Question

Hello to R/expss experts! This is a follow-up question to this one --> Complex tables with expss package.

I added subtotals to already complex tables using the excellent expss package, and it works well for most tasks (counts, proportions, means...). Yet, I found out statistical tests evaluation differ between one table without subtotals and the exact same with subtotals. @Gregory Demin, your knowledge would be greatly appreciated :)

An example to illustrate my words, using the infert dataset available in the datasets package:

example <- infert %>%
  tab_significance_options(sig_level=0.2, keep="none", sig_labels=NULL, subtable_marks="greater", mode="append") %>%
  tab_cols(total(), education) %>%
  tab_cells(parity) %>%
  # block for cases
  tab_stat_cases(label="N", total_row_position="above", total_statistic="u_cases", total_label="TOTAL") %>% 
  tab_last_add_sig_labels() %>%
  # block for percent statistic - Subtable tests  
  tab_stat_cpct(label="%Col.", total_row_position="above", total_statistic="u_cpct", total_label="TOTAL") %>%
  tab_last_add_sig_labels() %>%
  tab_last_sig_cpct(label="T.1", compare_type="subtable") %>%
  # block for percent statistic - First column tests
  tab_stat_cpct(label="T.2", total_row_position="above", total_statistic="u_cpct", total_label="TOTAL") %>%
  tab_last_add_sig_labels() %>%
  tab_last_sig_cpct(compare_type="first_column", mode="replace") %>%
  tab_pivot(stat_position="inside_columns") %>%
  # converts NA to zero
  recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
example <- example[,-c(4,5)]
print(example)

Note: sig_level is very high (20%) to illustrate this specific issue, do not panic :) This is the starting point and I am fine with that. Then we only add the subtotals (line 5):

example2 <- infert %>%
  tab_significance_options(sig_level=0.2, keep="none", sig_labels=NULL, subtable_marks="greater", mode="append") %>%
  tab_cols(total(), education) %>%
  tab_cells(parity) %>%
  tab_subtotal_cells("#FIRST 3"=c(1,2,3),"#LAST 3"=c(4,5,6), position = "above") %>%
  # block for cases
  tab_stat_cases(label="N", total_row_position="above", total_statistic="u_cases", total_label="TOTAL") %>% 
  tab_last_add_sig_labels() %>%
  # block for percent statistic - Subtable tests  
  tab_stat_cpct(label="%Col.", total_row_position="above", total_statistic="u_cpct", total_label="TOTAL") %>%
  tab_last_add_sig_labels() %>%
  tab_last_sig_cpct(label="T.1", compare_type="subtable") %>%
  # block for percent statistic - First column tests
  tab_stat_cpct(label="T.2", total_row_position="above", total_statistic="u_cpct", total_label="TOTAL") %>%
  tab_last_add_sig_labels() %>%
  tab_last_sig_cpct(compare_type="first_column", mode="replace") %>%
  tab_pivot(stat_position="inside_columns") %>%
  # converts NA to zero
  recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
example2 <- example2[,-c(4,5)]
print(example2)

I do not know what is happening, but the results of significance tests are not the same this time. Besides, I feel no significance test is calculated on the two subtotal rows. Any insight?

Gregory Demin · Accepted Answer · 2020-04-05T20:45:22.493

For significance testing between percents we need cases in the total statistic. So we will make total statistic with two rows. After all manipulations rows with total cases will be deleted. significance_cpct use # sign to detect total rows. And # in subtotals leads to incorrect results.

Taking into account all above:

example2 <- infert %>%
    tab_significance_options(sig_level=0.2, keep="none", sig_labels=NULL, subtable_marks="greater", mode="append") %>%
    tab_cols(total(), education) %>%
    tab_cells(parity) %>%
    tab_subtotal_cells("FIRST 3"=c(1,2,3),"LAST 3"=c(4,5,6), position = "above") %>%
    # block for cases
    tab_stat_cases(label="N", total_row_position="above", total_statistic="u_cases", total_label="TOTAL") %>% 
    tab_last_add_sig_labels() %>%
    # block for percent statistic - Subtable tests  
    # note additional total statistic
    tab_stat_cpct(label="%Col.", total_row_position="above", total_statistic= c("u_cases", "u_cpct"), 
                  total_label=c("TO DELETE", "TOTAL")) %>%
    tab_last_add_sig_labels() %>%
    tab_last_sig_cpct(label="T.1", compare_type="subtable") %>%
    # block for percent statistic - First column tests
    tab_stat_cpct(label="T.2", total_row_position="above", total_statistic= c("u_cases", "u_cpct"), 
                  total_label=c("TO DELETE", "TOTAL")) %>%
    tab_last_add_sig_labels() %>%
    tab_last_sig_cpct(compare_type="first_column", mode="replace") %>%
    tab_pivot(stat_position="inside_columns") %>%
    # drop row with TO_DELETE
    where(!grepl("TO DELETE", row_labels)) %>% 
    # converts NA to zero
    recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
example2 <- example2[,-c(4,5)]
print(example2)

UPDATE with net on columns:

data(infert)
example2 <- infert %>%
    apply_labels(
        education = "Education"
    ) %>% 
    tab_significance_options(sig_level=0.2, keep="none", sig_labels=NULL, subtable_marks="greater", mode="append") %>%
    tab_cols(total(), net(education, "LESS THAN 12 Y.O."=levels(education)[1:2])) %>%
    tab_cells(parity) %>%
    tab_subtotal_cells("FIRST 3"=c(1,2,3),"LAST 3"=c(4,5,6), position = "above") %>%
    # block for cases
    tab_stat_cases(label="N", total_row_position="above", total_statistic="u_cases", total_label="TOTAL") %>% 
    tab_last_add_sig_labels() %>%
    # block for percent statistic - Subtable tests  
    # note additional total statistic
    tab_stat_cpct(label="%Col.", total_row_position="above", total_statistic= c("u_cases", "u_cpct"), 
                  total_label=c("TO DELETE", "TOTAL")) %>%
    tab_last_add_sig_labels() %>%
    tab_last_sig_cpct(label="T.1", compare_type="subtable") %>%
    # block for percent statistic - First column tests
    tab_stat_cpct(label="T.2", total_row_position="above", total_statistic= c("u_cases", "u_cpct"), 
                  total_label=c("TO DELETE", "TOTAL")) %>%
    tab_last_add_sig_labels() %>%
    tab_last_sig_cpct(compare_type="first_column", mode="replace") %>%
    tab_pivot(stat_position="inside_columns") %>%
    # drop row with TO_DELETE
    where(!grepl("TO DELETE", row_labels)) %>% 
    # converts NA to zero
    recode(as.criterion(is.numeric) & is.na ~ 0, TRUE ~ copy)
example2 <- example2[,-c(4,5)]
print(example2)

As always thank you @Gregory Demin, everything works as intended! I was not aware of this limitation, though this seems obvious. I somehow though that because it appeared somewhere in the table the function would know where to look for the cases. For my information, based on the same dataset but with education in tab_cells / parity in tab_cols, if we wish to create subtotal based on factor levels, when can use --> tab_subtotal_cells("LESS THAN 12 Y.O."=levels(education)[1:2]) but do you have a more convenient method like --> c(1:2) (which does not work since levels are character strings) — Maxence Dum., Apr 03 '20 at 09:55
@MaxenceDum. Yes, there is no convenient methods for factors. You can convert them to labelled variables and then use numeric codes. Something like this `education = as.labelled(education)`. — Gregory Demin, Apr 03 '20 at 10:23
Sorry for my endless flow of questions @Gregory Demin, but I tried subtotals/nets column-wise with the following piece of code (just after tab_cols): `tab_net_cols("LESS THAN 12"=levels(education)[1:2],position="above") %>%`. Same problem I guess, but I do not know what statistic to associate and columns to remove like what you did with `where()`. — Maxence Dum., Apr 04 '20 at 08:56
@MaxenceDum.Net columns shouldn't affect on significance. If you need to remove some columns you can use `except(fixed("part of your column name"))` — Gregory Demin, Apr 04 '20 at 11:51
I did it the other way round, with `keep(row_labels, from(fixed(pattern = "#Total"))) %>%`, thanks for the tip! But even though the unwanted column are dropped, I still have the issue with significance tests. Sure everything is correct with the tab_net_cols instruction posted above? I feel the duplicated empty columns I had to delete shouldn't have been there in the first place. — Maxence Dum., Apr 04 '20 at 13:36
@MaxenceDum. `tab_net_cols` applies nets to all column variables, e. g. to the `total` in your table. In many cases it is very confusing. Try `tab_cols(total(), net(education, "LESS THAN 12 Y.O."=levels(education)[1:2]))` — Gregory Demin, Apr 04 '20 at 15:33
Thanks @Gregory Demin for helping me, did not think this would be so challenging for both of us. You are right, removing `total()` from the `tab_cols` instruction solves the whole table generation issues. But I no longer have the total column, which is not ideal either. Unfortunately your proposition did not work and leads to same computation error. Not possible to specify on which factor should the net be computed? Would it help to create a dummy variable beforehand and use it instead of the `tab_net_cols` command? — Maxence Dum., Apr 04 '20 at 19:02
Found a workaround, the `total()` and `net(education, "LESS THAN 12 Y.O."=levels(education)[1:2])` must be dissociated. By creating two sequences (one for each subtable), I get to the expected result. That makes a massive amount of code lines only for one table though, but this is an issue for another time. Thanks again @Gregory ! — Maxence Dum., Apr 05 '20 at 08:23
@MaxenceDum. If I correctly understand what you need - see update. There are two changes: `net` only on education and variable label on education. The latter helps to distinguish blocks of the table for significance testing. — Gregory Demin, Apr 05 '20 at 20:52
That is it! This highlights the importance to correctly label the variables. Thanks again @Gregory! — Maxence Dum., Apr 06 '20 at 14:10

Influence of subtotals on significance tests in expss tables

1 Answers1