2

I'm an R neophyte, and somehow this problem seems like it should be trivial to solve. But unfortunately, I haven't been able to do so after about three days of searching and experimenting.

My data is in a form close to wideform:

color   agegroup    sex     ses
red     2           Female  A
blue    2           Female  C
green   5           Male    D
red     3           Female  A
red     2           Male    B
blue    1           Female  B
...

I'm trying to create presentable tables with counts and percentages of the dependent variable (color here) organized by sex, ses and agegroup. I need one table organized by ses and sex for each agegroup, with counts next to the percentages, like this:

agegroup:                                  1
sex:                  Female                               Male
ses:        A       B       C       D           A       B       C       D
color:
red         2 1%    0  0%   8 4%    22 11%      16 8%   2   1%  8   4%  3 1.5%
blue        9 4.5%  6  3%   4 2%    2  1%       12 6%   32 16%  14  7%  6   3%
green       4 2%    12 6%   2 1%    8  4%       0  0%   22 11%  40 20%  0   0%

agegroup:                               2
sex:                  Female                               Male
ses:        A       B       C       D           A       B       C       D
color:
red         2 1%    0  0%   8 4%    22 11%      16 8%   2   1%  8   4%  3 1.5%
blue        9 4.5%  6  3%   4 2%    2  1%       12 6%   32 16%  14  7%  6   3%
green       4 2%    12 6%   2 1%    8  4%       0  0%   22 11%  40 20%  0   0%

I've been trying to do this with everything from datatables and expss to gmodels, but I just can't figure out how to get output like this. CrossTables from gmodels comes closest, but it's still pretty far away -- (1) it puts percentages under counts, (2) I can't get it to nest sel under sex, (3) I can't figure out how to get it to disgregate the results by generation, and (4) the output is full of dashes, vertical pipes and spaces which make putting it into a word processor or spreadsheet an error-prone manual affair.

EDIT: I removed my second question (about line plots), because the answer to the first question is perfect and deserves credit, even if it doesn't touch on the second one. I'll ask the second question separately, as I should have from the start.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Gil Williams
  • 359
  • 1
  • 3
  • 10
  • `janitor::tabyl` might be worth a look. – Jon Spring Sep 15 '18 at 04:19
  • I appreciate the suggestion. Two-variable tables work fine (though they're not what I need). When I try to add a third variable, however, it throws an error that I can't decipher (`class` shows that the object I'm working on is a dataframe): Error in class(dat[[1]]) <- new_class : adding class "factor" to an invalid object In addition: Warning messages: 1: In if (col1_class %in% "factor") { : the condition has length > 1 and only the first element will be used 2: In if (new_class %in% "factor") { : the condition has length > 1 and only the first element will be used – Gil Williams Sep 15 '18 at 08:43
  • I need 4-way tables (i.e. multiple 3-way tables), and `tabyl` seems to be limited to 3-ways. It doesn't seem to be able to produce the "presentable table" from my example. – Gil Williams Sep 15 '18 at 08:56

2 Answers2

2

The closest result with expss package:

library(expss)
# generate example data
set.seed(123)
N = 300
df = data.frame(
    color = sample(c("red", "blue", "green"), size = N, replace = TRUE),
    agegroup = sample(1:5, size = N, replace = TRUE),
    sex = sample(c("Male", "Female"), size = N, replace = TRUE),
    ses = sample(c("A", "B", "C", "D"),  size = N, replace = TRUE),
    stringsAsFactors = FALSE
)

# redirect output to RStudio HTML viewer
expss_output_viewer()
res = df %>% 
    tab_cells("|" = color) %>% # dependent variable, "|" used to suppress label
    tab_cols(sex %nest% ses) %>% # column variable
    tab_rows(agegroup) %>% 
    tab_total_row_position("none") %>% # we don't need total
    tab_stat_cases(label = "Cases") %>% # calculate cases
    tab_stat_cpct(label = "%") %>% # calculate percent
    tab_pivot(stat_position = "inside_columns") %>% # finalize table
    make_subheadings(number_of_columns = 2)

# difficult part - add percent sign
for(i in grep("%", colnames(res))){
    res[[i]] = ifelse(trimws(res[[i]])!="", 
                      paste0(round(res[[i]], 1), "%"),
                      res[[i]] 
                      )
}

# additionlly remove stat labels
colnames(res) = gsub("\\|Cases|%", "", colnames(res), perl = TRUE)

res

In the RStudio Viewer result will be in the HTML format (see image). Unfortunately, I can't test how it will be pasted to the MS Word. enter image description here Disclaimer: I am an author of expss package.

Gregory Demin
  • 4,596
  • 2
  • 20
  • 20
  • Nice!! That produces tables in exactly the shape I needed! Thanks so much. Two follow-up questions... Since pasting HTML is a bit error-prone, I'm looking to paste plain text and then make it pretty manually. (1) But if I use `expss_output_raw`, the table headers are flattened and interleaved (e.g. `row_labels Female|A Female|A| Female|B Female|B|`) rather than having `sex` in the first row and `ses` in the second. How can I achieve this second format? (2) In the `agegroup` rows i get `NA ` in each column when the whole row should be empty except for the label. Suggestions? Thanks again! – Gil Williams Sep 15 '18 at 17:29
  • 1
    @GilWilliams Try `expss_output_default()` – Gregory Demin Sep 15 '18 at 17:30
  • 1
    @GilWilliams Another way - drop table to text file with tab-delimiter: `fwrite(split_table_to_df(res), "table.tab", sep = "\t", col.names = FALSE, quote = FALSE)` – Gregory Demin Sep 15 '18 at 17:34
  • You, sir, are the Jimi Hendrix of R tables. Thanks again! – Gil Williams Sep 15 '18 at 17:38
  • One final thing: the line-by-line comments in your code are *extremely* useful! I wish all code examples had them! – Gil Williams Sep 15 '18 at 17:56
  • I've run into one issue -- I'm using `htmlTable` on `res` and then saving it. The files are UTF-8 encoded at the text level (good!), but when I open them with a browser my (many, many) Unicode characters are mangled beyond recognition. Neither `options(expss.fix_encoding = FALSE)` nor `options(expss.fix_encoding = TRUE)` have any effect. Suggestions? – Gil Williams Sep 15 '18 at 21:09
  • @GilWilliams How do you save result of `htmlTable`? I don't see any issues with result of `writeLines(htmlTable(res), "table.html")`. – Gregory Demin Sep 16 '18 at 10:05
  • I'm saving the same way you mentioned. The problems only show up when there are characters from outside the range where ASCII and Unicode overlap. For example, `áéíóúñ` in the R script and in the RStudio viewing panel is displayed as `áéíóúñ` when the resulting HTML file is opened in a browser. It looks fine in the source code, though (when viewed in a text editor). – Gil Williams Sep 16 '18 at 10:27
  • @GilWilliams It seems that browser doesn't correctly detect encoding of your file. Try to change page encoding in browser to UTF-8 manually. It works for me. – Gregory Demin Sep 16 '18 at 11:03
  • Looking at the file that R saves, there's actually no HTML encoding or character information at all (the file is UTF-8 encoded at the text level, but that's of course a different matter). So browsers have to guess at that, and mine do it wrong. Any way to add that info? I think it would involve inserting something like `` after ``, but htmlTable doesn't seem to be able to do that. – Gil Williams Sep 16 '18 at 18:03
  • 1
    @GilWilliams We can do it with `writeLines(paste0('', htmlTable(res)), "table.html")`. – Gregory Demin Sep 16 '18 at 18:41
  • Worked like a charm! Thanks so much for all your help! – Gil Williams Sep 17 '18 at 01:18
  • Any updates on this solution that require fewer steps? Putting counts and percentages adjacent to one another is a common tabulation layout. – Ben May 02 '22 at 15:32
1

You can use adorn_ns(position = "front") from the janitor package. It will give you counts and percentages together.

For example, this code:

df %>%  
arrange(desc(all)) %>%  
adorn_percentages("col") %>%
adorn_pct_formatting() %>% 
adorn_ns(position = "front") %>%
as.data.frame()

gives this output:

example

rm1104
  • 193
  • 3
  • 8