Summary data by column in R

Question

I have the following data

pt_id <- c(1,1,1,1,1,2,2,2,3,3,3,3,3,4,4,4,4)
Tob_pk <- c(2, 5, 7, 1, 8, 12, 14, 3, 6, 8, 10, 20, 13, 5, 4, 12, 10)
Tobacco <- c("Once","Twice","Never", NA, NA, NA, NA, NA,"Once","Twice","Quit","Once",NA,NA,"Never", NA, "Never")
Alcohol <- c("Twice", "Once",NA, NA, "Never", NA, NA, "Once", NA, "Quit", "Twice", NA, "Once", NA, NA, "Never", "Never")
PA <- c("Once",NA,"Never", NA, NA, NA, NA, NA,"Once",NA,"Quit","Once",NA,NA,"Never", NA, NA)
mydata <- data.frame(pt_id, Tob_pk, Tobacco, Alcohol, PA)
mydata

I want the summary/ proportions for each variable in my dataset, I tried using the following code to get the summary/ proportions for each variable

data_summ <- mydata %>%
  summarize_at(.vars=3:5, funs(prop.table(.)))

However, I am getting the following error

Error: Problem with `summarise()` input `Tobacco`.
x invalid 'type' (character) of argument
ℹ Input `Tobacco` is `prop.table(Tobacco)`.
Run `rlang::last_error()` to see where the error occurred.

I am not sure where I am going wrong. It would be helpful if I could get any suggestions to get the following output but with the percentage of NA's too.

Tobacco                   Alcohol                        PA

Never    0.3333333      Never    0.3333333       Never    0.3333333  
Once     0.3333333      Once     0.3333333       Once     0.5000000
Quit     0.1111111      Quit     0.1111111       Quit.    0.1666667
Twice    0.2222222      Twice    0.2222222

Thanks in advance!

score 3 · Accepted Answer · edited Jan 19 '21 at 07:11

using base

pt_id <- c(1,1,1,1,1,2,2,2,3,3,3,3,3,4,4,4,4)
Tob_pk <- c(2, 5, 7, 1, 8, 12, 14, 3, 6, 8, 10, 20, 13, 5, 4, 12, 10)
Tobacco <- c("Once","Twice","Never", NA, NA, NA, NA, NA,"Once","Twice","Quit","Once",NA,NA,"Never", NA, "Never")
Alcohol <- c("Twice", "Once",NA, NA, "Never", NA, NA, "Once", NA, "Quit", "Twice", NA, "Once", NA, NA, "Never", "Never")
PA <- c("Once",NA,"Never", NA, NA, NA, NA, NA,"Once",NA,"Quit","Once",NA,NA,"Never", NA, NA)
mydata <- data.frame(pt_id, Tob_pk, Tobacco, Alcohol, PA)

apply(mydata[3:5], 2, function(x) prop.table(table(x, useNA="ifany")))
$Tobacco
x
     Never       Once       Quit      Twice       <NA> 
0.17647059 0.17647059 0.05882353 0.11764706 0.47058824 

$Alcohol
x
     Never       Once       Quit      Twice       <NA> 
0.17647059 0.17647059 0.05882353 0.11764706 0.47058824 

$PA
x
     Never       Once       Quit       <NA> 
0.11764706 0.17647059 0.05882353 0.64705882

^{Created on 2021-01-18 by the reprex package (v0.3.0)}

using tidyverse

library(tidyverse)
map_dfr(mydata[3:5], ~prop.table(table(.x))) 
#> # A tibble: 3 x 4
#>   Never     Once      Quit      Twice    
#>   <table>   <table>   <table>   <table>  
#> 1 0.3333333 0.3333333 0.1111111 0.2222222
#> 2 0.3333333 0.3333333 0.1111111 0.2222222
#> 3 0.3333333 0.5000000 0.1666667        NA

^{Created on 2021-01-18 by the reprex package (v0.3.0)}

Thank you for the help! In my original dataset, I have almost 35 variables/ columns and some are "dbl" variable and not character variables. Is there a way I can skip those columns with "dbl" variables in my code without having to select a range (eg., 3:5 and then 8:5) manually? — PriyamK, Jan 18 '21 at 22:02
And can we export the output from base R code into an excel file? — PriyamK, Jan 18 '21 at 22:36
try it like this `map_dfr(.x = select(mydata, where(is.character)), .f = prop.table(table(.x)))` — Yuriy Saraykin, Jan 19 '21 at 07:30
for `base` `cols_char <- sapply(mydata, is.character)` `apply(mydata[, cols_char], 2, function(x) prop.table(table(x, useNA="ifany")))` — Yuriy Saraykin, Jan 19 '21 at 07:45

score 1 · Answer 2 · answered Jan 18 '21 at 17:07

Using dplyr:

library(tidyverse)

df <- mydata %>%
  select(3:5) %>%
  gather('Your_vice', 'freq', 1:3) %>%
  group_by(Your_vice, freq) %>%
  summarize(n = n()) %>%
  mutate(perc = n/sum(n))

df

# A tibble: 14 x 4
# Groups:   Your_vice [3]
   Your_vice freq      n   perc
   <chr>     <chr> <int>  <dbl>
 1 Alcohol   Never     3 0.176 
 2 Alcohol   Once      3 0.176 
 3 Alcohol   Quit      1 0.0588
 4 Alcohol   Twice     2 0.118 
 5 Alcohol   NA        8 0.471 
 6 PA        Never     2 0.118 
 7 PA        Once      3 0.176 
 8 PA        Quit      1 0.0588
 9 PA        NA       11 0.647 
10 Tobacco   Never     3 0.176 
11 Tobacco   Once      3 0.176 
12 Tobacco   Quit      1 0.0588
13 Tobacco   Twice     2 0.118 
14 Tobacco   NA        8 0.471

jay.sf · Answer 3 · 2021-01-18T17:16:18.443

You could transform the variables to factors with corresponding levels. Then table with option useNA="ifany" displays the NA.

mydata[3:5] <- lapply(mydata[3:5], factor, levels=c("Never", "Once", "Quit", "Twice"))
res1 <- sapply(mydata[3:5], function(x) prop.table(table(x)))
res1
#         Tobacco   Alcohol        PA
# Never 0.3333333 0.3333333 0.3333333
# Once  0.3333333 0.3333333 0.5000000
# Quit  0.1111111 0.1111111 0.1666667
# Twice 0.2222222 0.2222222 0.0000000

res2 <- sapply(mydata[3:5], function(x) prop.table(table(x, useNA="ifany")))
res2
#          Tobacco    Alcohol         PA
# Never 0.17647059 0.17647059 0.11764706
# Once  0.17647059 0.17647059 0.17647059
# Quit  0.05882353 0.05882353 0.05882353
# Twice 0.11764706 0.11764706 0.00000000
# <NA>  0.47058824 0.47058824 0.64705882

Can the output be exported to excel file? I tried using ```write.xlsx(res2, file = "myDATA.xlsx", sheetName = "res2", col.names = TRUE, row.names = TRUE, append = FALSE)``` but it's not working! — PriyamK, Jan 19 '21 at 17:18

score 0 · Answer 4 · answered Jan 18 '21 at 17:19

Here is a base R option using prop.table + table + na.omit + factor

do.call(
  cbind,
  lapply(
    mydata[3:5],
    function(x) {
      prop.table(
        table(
          na.omit(
            factor(x, levels = unique(na.omit(unlist(mydata[3:5]))))
          )
        )
      )
    }
  )
)

which gives

        Tobacco   Alcohol        PA
Once  0.3333333 0.3333333 0.5000000
Twice 0.2222222 0.2222222 0.0000000
Never 0.3333333 0.3333333 0.3333333
Quit  0.1111111 0.1111111 0.1666667

Summary data by column in R

4 Answers4