3

I have used R Studio now for years and more often so than any other software, but now that I'm gioing to teach statistics with R, I realize that some tasks are just simpler using other software such as STATA.

Is there a simple way of getting a frequency table in R (including count, percent, and cumulative frequencies) just like we would get by typing tab [variable] in STATA?

I came across this tidyverse solution:

dataset <- tribble(
           ~var1, ~var2, ~var3, ~var4, ~var5,
           "1",   "1",   "1",   "a",   "d",
           "2",   "2",   "2",   "b",   "e",
           "3",   "3",   "3",   "c",   "f")

dataset %>%
      group_by(var1) %>%
      summarise(n = n()) %>%
      mutate(totalN = (cumsum(n)),
             percent = round((n / sum(n)), 3),
             cumpercent = round(cumsum(freq = n / sum(n)),3))

But this is, very obviously, far to complicated to teach undergrads. Isn't there an easier way, maybe a base R solution even? Ideally, I would like to have one line of code for which I don't have to install 5-10 different packages first.

Dr. Fabian Habersack
  • 1,111
  • 12
  • 30
  • Source: https://stylizeddata.com/stata-to-r-how-to-tabulate-a-categorical-variable – Dr. Fabian Habersack Sep 12 '19 at 16:38
  • 4
    "But this is, very obviously, far too complicated to teach undergrads" Do you have data to back this up? it reads as one would do the math...if they can handle the math, they should be able to handle the operations in your `mutate` call – Matias Andina Sep 12 '19 at 16:39
  • 1
    Sure, at some point you'll understand this and we obiously do, because we know the syntax. But if you teach stats at a very basic, introductory level, then I am sure we both will agree that `tab` is much easier and handier than this dplyr solution, no? – Dr. Fabian Habersack Sep 12 '19 at 16:41
  • 2
    I am not sure if there is a base function for that specific task. Nevertheless, I think you are approaching the "Stata-R" debate wrong. The tidyverse solution is not complicated, tidyverse was designed to be easy to use and easy to read. The code you present is very intuitive and a person (lundergrad student) can understand what is going and use that knowledge in many other problems. Asume you have a larger dataset and that you want the same result but grouping multiple variables: conceptually, you only have to make a small change in the group_by function. Here is where you benefit from R. – Orlando Sabogal Sep 12 '19 at 16:46
  • 2
    You can remove a line from your code by using `count(var1)` – bouncyball Sep 12 '19 at 16:46
  • 1
    OK that is all true, and learning this all as soon as possible will come in very handy later on as the step from there to applying this to another problem or dataset will obviously be smaller. But I would still argue that this needs to be learned step by step: before using {dplyr}, one needs to understand how to set the working directory (etc.). So if you want to understand how code produces output when you press `Ctrl + Enter`, you will want to have a very short and simple code that is easy to grasp. Just for the record: I'm not saying STATA is better. – Dr. Fabian Habersack Sep 12 '19 at 16:54

1 Answers1

4

I don't agree with your claims about undergrads not being able to understand. I don't want to get this question into a teaching strategies and whether you should be using R if you don't believe it's proper for the level of your course.

You can supply them with this function, which they don't have to understand (the same way they don't have to understand the one from STATA).

library(dplyr)
tab <- function(dataset, var){

  dataset %>%
    # embrace var to be able to call it with any grouping factor
    group_by({{var}}) %>% 
    summarise(n=n()) %>%
    mutate(totalN = cumsum(n),
           percent = n / sum(n),
           cumpercent = cumsum(n / sum(n)))

}

Then (provided you source("tab.R")), here's your one liner:

tab(dataset, var1)
# A tibble: 3 x 5
  var1      n totalN percent cumpercent
  <chr> <int>  <int>   <dbl>      <dbl>
1 1         1      1   0.333      0.333
2 2         1      2   0.333      0.667
3 3         1      3   0.333      1  

You can try tab(dataset, var2). Please note that this answer will only group by one factor (this was your question).

EDIT

one needs to understand how to set the working directory (etc.)

Not entirely true, if you are using Rstudio, you can manually import a dataset with clicks from a folder. If you want to teach stats using R (which I think you definitely should), you should have at least one class of minimal things (yes, that includes working directory, how to call library(...) and basic functions). There are a huge amount of resources (books, YouTube tutorials) you can assign as homewokrs/part of the class, so students become familiar. The argument of WHATEVER SOFTWARE IS EASIER is weak if we drop all assumptions, I would need to know how where to click for the specific version of whatever software...

Matias Andina
  • 4,029
  • 4
  • 26
  • 58