46

I have a contingency table of counts, and I want to extend it with corresponding proportions of each group.

Some sample data (tips data set from ggplot2 package):

library(ggplot2)

head(tips, 3)
#   total_bill tip    sex smoker day   time size
# 1         17 1.0 Female     No Sun Dinner    2
# 2         10 1.7   Male     No Sun Dinner    3
# 3         21 3.5   Male     No Sun Dinner    3

First, use table to count smoker vs non-smoker, and nrow to count total number of subjects:

table(tips$smoker)
#  No Yes 
# 151  93 

nrow(tips)
# [1] 244

Then, I want to calculate percentage of smokers vs. non smokers. Something like this (ugly code):

# percentage of smokers
options(digits = 2)

transform(as.data.frame(table(tips$smoker)), percentage_column = Freq / nrow(tips) * 100)
#   Var1 Freq percentage_column
# 1   No  151                62
# 2  Yes   93                38

Is there a better way to do this?

(even better it would be to do this on a set of columns (which I enumerate) and have output somewhat nicely formatted) (e.g., smoker, day, and time)

Henrik
  • 65,555
  • 14
  • 143
  • 159
userJT
  • 11,486
  • 20
  • 77
  • 88

6 Answers6

88

If it's conciseness you're after, you might like:

prop.table(table(tips$smoker))

and then scale by 100 and round if you like. Or more like your exact output:

tbl <- table(tips$smoker)
cbind(tbl,prop.table(tbl))

If you wanted to do this for multiple columns, there are lots of different directions you could go depending on what your tastes tell you is clean looking output, but here's one option:

tblFun <- function(x){
    tbl <- table(x)
    res <- cbind(tbl,round(prop.table(tbl)*100,2))
    colnames(res) <- c('Count','Percentage')
    res
}

do.call(rbind,lapply(tips[3:6],tblFun))
       Count Percentage
Female    87      35.66
Male     157      64.34
No       151      61.89
Yes       93      38.11
Fri       19       7.79
Sat       87      35.66
Sun       76      31.15
Thur      62      25.41
Dinner   176      72.13
Lunch     68      27.87

If you don't like stack the different tables on top of each other, you can ditch the do.call and leave them in a list.

joran
  • 169,992
  • 32
  • 429
  • 468
  • hmm.. I did not think about "chaining" the table and prop.table like that. how would one do it for multiple enumerated columns...? – userJT Mar 08 '12 at 22:10
  • @user56 Edited with a possible example (but there are lots of different ways to approach what you describe). – joran Mar 08 '12 at 22:18
  • can dplyr add such function (we all use it all the time) – userJT Mar 01 '16 at 14:33
  • I think this code reads also the first row for the analysis. - - I get `female 22 45.83 Gender 1 2.08 male 25 52.08`. How can you skip it? - - My data's separator is tab `\t`. Just do `do.call(rbind,lapply(tips[-(1), 2:3],tblFun))`. – Léo Léopold Hertz 준영 Oct 28 '16 at 13:47
13

Your code doesn't seem so ugly to me...
however, an alternative (not much better) could be e.g. :

df <- data.frame(table(yn))
colnames(df) <- c('Smoker','Freq')
df$Perc <- df$Freq / sum(df$Freq) * 100

------------------
  Smoker Freq Perc
1     No   19 47.5
2    Yes   21 52.5
digEmAll
  • 56,430
  • 9
  • 115
  • 140
4

I am not 100% certain, but I think this does what you want using prop.table. See mostly the last 3 lines. The rest of the code is just creating fake data.

set.seed(1234)

total_bill <- rnorm(50, 25, 3)
tip <- 0.15 * total_bill + rnorm(50, 0, 1)
sex <- rbinom(50, 1, 0.5)
smoker <- rbinom(50, 1, 0.3)
day <- ceiling(runif(50, 0,7))
time <- ceiling(runif(50, 0,3))
size <- 1 + rpois(50, 2)
my.data <- as.data.frame(cbind(total_bill, tip, sex, smoker, day, time, size))
my.data

my.table <- table(my.data$smoker)

my.prop <- prop.table(my.table)

cbind(my.table, my.prop)
Mark Miller
  • 12,483
  • 23
  • 78
  • 132
1

Here's a tidyverse version:

library(tidyverse)
data(diamonds)

(as.data.frame(table(diamonds$cut)) %>% rename(Count=1,Freq=2) %>% mutate(Perc=100*Freq/sum(Freq)))

Or if you want a handy function:

getPercentages <- function(df, colName) {
  df.cnt <- df %>% select({{colName}}) %>% 
    table() %>%
    as.data.frame() %>% 
    rename({{colName}} :=1, Freq=2) %>% 
    mutate(Perc=100*Freq/sum(Freq))
}

Now you can do:

diamonds %>% getPercentages(cut)

or this:

df=diamonds %>% group_by(cut) %>% group_modify(~.x %>% getPercentages(clarity))
ggplot(df,aes(x=clarity,y=Perc))+geom_col()+facet_wrap(~cut)
Robert
  • 924
  • 1
  • 9
  • 20
  • I am getting "Error: `data` must be a data frame, or other object coercible by `fortify()`, not a list" while running: https://pastebin.com/ZE58J3Ru Any way to fix it? – vasili111 Jul 24 '19 at 15:23
  • 1
    In the new version of dplyr the functionality of group_map has changed now group_modify should be used. I changed the example – Robert Jul 24 '19 at 19:19
0

I made this for when doing aggregate functions and similar

per.fun <- function(x) {
    if(length(x)>1){
        denom <- length(x);
        num <- sum(x);
        percentage <- num/denom;
        percentage*100
        }
        else NA
    }
James Holland
  • 1,102
  • 10
  • 17
0

Here is another example using the lapply and table functions in base R.

freqList = lapply(select_if(tips, is.factor), 
              function(x) {
                  df = data.frame(table(x))

                  df = data.frame(fct = df[, 1], 
                                  n = sapply(df[, 2], function(y) {
                                      round(y / nrow(dat), 2)
                                    }
                                )
                            )
                  return(df) 
                    }
                )

Use print(freqList) to see the proportion tables (percent of frequencies) for each column/feature/variable (depending on your tradecraft) that is labeled as a factor.

DataSci-IOPsy
  • 308
  • 2
  • 11