Use lapply() to find percentages of factor variables

Question

I have a data frame that consists of 4 columns that represent questions, and each column as 4 levels that represent responses.

  Q1 Q2
1  A  A
2  A  B
3  B  B
4  C  C
5  D  D

And I'd like to derive a data.frame such as this:

   question response percent
1       Q2        A     0.2
2       Q2        B     0.4
3       Q2        C     0.2
4       Q2        D     0.2
5       Q1        A     0.4
6       Q1        B     0.2
7       Q1        C     0.2
8       Q1        D     0.2

So far, I've been achieving this with a for loop, but my scripts are riddled with for loops so I'd like to achieve this using functions in reshape2 or with lapply. For instance this code is a lot cleaner than a for loop but still not quite what I'm looking for. Any help would be greatly appreciated!

Here's what I've got so far:

lapply(lapply(df, summary), function(x) x/sum(x))

EDIT: Including example of data frame per request. I was originally afraid it would take up too much space since the level labels are so long, so I shortened them.

dput(df[1:4,])
structure(list(Q1 = structure(c(4L, 4L, 1L, 4L), .Label = c("1.A", 
    "1.B", "1.C", "1.D"), class = "factor"), 
    Q2 = structure(c(4L, 4L, 4L, 1L), .Label = c("2.A","2.B",
    "2.C", "2.D"), class = "factor"), 
    Q3 = structure(c(4L, 3L, 4L, 4L), .Label = c("3.A","3.B",
    "3.C","3.D"), class = "factor"), 
    Q4 = structure(c(3L, 1L, 3L, 3L), .Label = c("4.A","4.B", 
    "4.C","4.D")), 
    .Names = c("Q1.pre", "Q2.pre", "Q3.pre", "Q4.pre"), row.names = c(NA, 4L), 
    class = "data.frame")

I've found that a combination of Lafortune and user20650's responses has given me almost exactly what I've been looking for:

melt(sapply(df, function(x) prop.table(table(x))))

However there's one problem. At the sapply level, the dimnames are the same as the label names of the levels for Q1, and so after performing melt the output of sapply, the Var1 column is just a repetition of Q1s levels, whereas I'd like Var1 to have Q1's levels in the Q1 rows, Q2's levels in the Q2 rows, etc. I found a workaround by pulling the levels of all of the columns into a separate variable qnames before performing any operations on df like so:

qnames = melt(sapply(df, levels))
qnames = qnames[ ,3]
melt(sapply(df, function(x) prop.table(table(x))))
df = cbind(qnames, df)

Which is exactly the result I need. I'm interested to see if there is a way to achieve this without the extra sapply and cbind, so I'll leave the question open a little longer. Thanks for your help!

**Please post a reproducible example.** Use `dput()` on your dataframe and give us e.g. the top-20 rows — smci, Jul 10 '15 at 20:57
This is just an aggregation. Read about [tag:split-apply-combine], dplyr and data.table. Do a `group_by(question)`. — smci, Jul 10 '15 at 21:29
(by the way, `percent` should be called `fraction`, unless you multiply it by 100 in your function) — smci, Jul 10 '15 at 21:29

Pierre L · Answer 1 · 2015-07-10T21:06:44.107

2

library(reshape2)
indx <- lapply(df, function(x) prop.table(table(x)))
out <- melt(do.call(rbind, indx))
out <- out[order(out$Var1, decreasing=TRUE),];rownames(out) <- NULL
#   Var1 Var2 value
# 1   Q2    A   0.2
# 2   Q2    B   0.4
# 3   Q2    C   0.2
# 4   Q2    D   0.2
# 5   Q1    A   0.4
# 6   Q1    B   0.2
# 7   Q1    C   0.2
# 8   Q1    D   0.2

You can use lapply and reshape2, but a good function to have is prop.table. When it wraps the table function, it will do the percentage for you. We create a proportion table for each column using lapply and then combine them with the always faithful do.call(rbind, lst). The last line is for the 'look'.

edited Jul 10 '15 at 21:06

answered Jul 10 '15 at 21:01

Pierre L

28,203
6
47
69

1

if you use `sapply` you can avoid the `do.call` stuff ... ie `melt(sapply(dat, function(x) prop.table(table(x))))` – user20650 Jul 10 '15 at 21:37
1

Thank you. Always great to optimize! – Pierre L Jul 10 '15 at 21:44
No probs... the lapply / do.call might actually be faster ... but the sapply imo is more readable here – user20650 Jul 10 '15 at 21:45
I can't test right now (SO mobile), but I think one function is usually better than two. – Pierre L Jul 10 '15 at 21:55
1

I agree with greater readability for `sapply` - thanks, that was the route I pursued. – cangers Jul 17 '15 at 13:35

score 2 · Accepted Answer · answered Jul 10 '15 at 21:07

2

One-liner using data.table:

library(data.table) # 1.9.5+
dt<-data.table(Q1=c("A","A","B","C","D"),
               Q2=c("A","B","B","C","D"))

rbindlist(lapply(
  names(dt),
  function(x)dt[,.N/nrow(dt),by=x
                ][,.(question=x,response=get(x),percent=V1)]))

answered Jul 10 '15 at 21:07

MichaelChirico

33,841
14
113
198

Thank you, this is incredibly efficient code! However, I prefer to use the `sapply` and `lapply` functions because I think the code is more transparent and easier to reproduce (I definitely need to read up on data.table package). – cangers Jul 17 '15 at 13:34
@Killerfurbies you'll notice the crux of my code is indeed `lapply`. If you want a great resource for starting `data.table`, check out the Getting Started vignettes on their github page, they're crystal clear ways to dig in. And the payoff to learning `data.table`, I must add, is IMO immense. – MichaelChirico Jul 17 '15 at 13:42
I've returned to your code and am trying to implement it in a custom function, but I keep receiving the error `Error in .subset(x, j) : invalid subscript type 'list'`. I've been scouring stackoverflow and I believe it has to do with the frame in which the `j` expression, or `by`, is evaluated. Do you have any ideas as to how to solve this? – cangers Aug 18 '15 at 19:37
That's because the `.(` alias is new to the development version of `data.table`, see the [GitHub](https://github.com/Rdatatable/data.table/wiki/Installation) for instructions on installing the latest updates. If you don't feel like doing that (I recommend it because there are all sorts of goodies in 1.9.5), you can replace `.(question...` with `list(quesion...` – MichaelChirico Aug 18 '15 at 19:40
Ah, I was indeed using an older version of `data.table`. Now, after replacing `.(` with `list(`, I'm getting the error `Error in get(x) : object 'Q1' not found`. Again, this is only in the personal function that it happens, not when I run the same code in the console. Thanks for your help @MichaelChirico! – cangers Aug 19 '15 at 14:30
@Killerfurbies strange. Code still works for me. Could you print the output of `names(dt)`? – MichaelChirico Aug 19 '15 at 15:52
This is the output of `print(names(dt))` immediately before your code: `[1] "Q1" "Q2" "Q3" "Q4"` `Error in get(x) : object 'Q1' not found` Again, it works perfectly fine when I enter the code directly into the console, it's just within my function that it returns this error. I often run into problems when manipulating `data.table` objects within functions, usually when modifying columns by reference. But that can be solved by enclosing the `LHS` in parentheses. Perhaps this deserves a new thread... – cangers Aug 19 '15 at 17:25

Use lapply() to find percentages of factor variables

2 Answers2