Count occurrences of a value on melted data using dplyr

Question

I am trying to do a simple table after have melt my data, but using dplyr.

My data look like this

   cluster   21:30   21:45
4        c   alone   alone
6        b       %       %
12       e partner partner
14       b partner partner
20       b   alone   alone
22       c partner partner

With table I can simply

table(dta$cluster)
   a b c d e 
   2 8 5 1 4

How can I get the same results using melt and summarise?

 library(dplyr)
 library(reshape2)

 dta %>% 
 melt(id.vars = 'cluster')  %>% 
 group_by(cluster) %>% 
 summarise( n() )

What I need exactly is to table the clusters after having melt the data.

So to count correctly this data.frame

 dta %>% 
 melt(id.vars = 'cluster')

The expected output is this one

      cluster variable   value n_cluster
1        a    21:30       .         2
2        a    21:30 nuclear         2
3        a    21:45       .         2
4        a    21:45 nuclear         2
5        b    21:30       %         8
6        b    21:30 partner         8
7        b    21:30   alone         8
8        b    21:30 partner         8
9        b    21:30 partner         8
10       b    21:30 nuclear         8
11       b    21:30 partner         8
12       b    21:30 partner         8
13       b    21:45       %         8
14       b    21:45 partner         8
15       b    21:45   alone         8
16       b    21:45 partner         8
17       b    21:45 partner         8
18       b    21:45 nuclear         8
19       b    21:45 partner         8
20       b    21:45 partner         8
21       c    21:30   alone         5
22       c    21:30 partner         5
23       c    21:30       %         5
24       c    21:30 partner         5
25       c    21:30 partner         5
26       c    21:45   alone         5
27       c    21:45 partner         5
28       c    21:45       %         5
29       c    21:45 partner         5
30       c    21:45 partner         5
31       d    21:30 partner         1
32       d    21:45   alone         1
33       e    21:30 partner         4
34       e    21:30 nuclear         4
35       e    21:30 nuclear         4
36       e    21:30 nuclear         4
37       e    21:45 partner         4
38       e    21:45 nuclear         4
39       e    21:45 nuclear         4
40       e    21:45 nuclear         4

Any idea?

dta = structure(list(cluster = structure(c(3L, 2L, 5L, 2L, 2L, 3L, 
5L, 3L, 1L, 3L, 1L, 2L, 5L, 3L, 2L, 2L, 2L, 2L, 4L, 5L), .Label = c("a", 
"b", "c", "d", "e"), class = "factor"), `21:30` = structure(c(2L, 
7L, 5L, 5L, 2L, 5L, 4L, 7L, 1L, 5L, 4L, 5L, 4L, 5L, 5L, 4L, 5L, 
5L, 5L, 4L), .Label = c(".", "alone", "children", "nuclear", 
"partner", "*", "%"), class = "factor"), `21:45` = structure(c(2L, 
7L, 5L, 5L, 2L, 5L, 4L, 7L, 1L, 5L, 4L, 5L, 4L, 5L, 5L, 4L, 5L, 
5L, 2L, 4L), .Label = c(".", "alone", "children", "nuclear", 
"partner", "*", "%"), class = "factor")), .Names = c("cluster", 
"21:30", "21:45"), row.names = c("4", "6", "12", "14", "20", 
"22", "23", "28", "30", "32", "36", "38", "40", "42", "44", "48", 
"50", "56", "57", "60"), class = "data.frame")

you don;t need melt, and you have your answer: `dta %>% group_by(cluster) %>% summarise( n() )` — jeremycg, Aug 12 '15 at 20:43
My first instinct when I see "my data looks like this" is to use `read.table` to put it into R... maybe it would be better to simply use the smaller data or to omit the arguably misleading display unless preceded by a clear `head(dta)` call. — Frank, Aug 12 '15 at 20:43
ok I will do it in Excel later tonight but for now I need to do it like this :/ I know right — giac, Aug 12 '15 at 20:43
If I ask the problem I'm trying to solve with this you guys are going to kill me. So please any answer ? David please don't let me do it in excel — giac, Aug 12 '15 at 21:00
@giacomoV `count(dta, cluster)` in the comment above is *the* answer to the question as posed... — Frank, Aug 12 '15 at 21:10

David Arenburg · Accepted Answer · 2015-08-13T09:26:58.200

6

I can't seem to find a good dupe for this, but a simple dplyr idiom will be just using count

count(dta, cluster)
# Source: local data frame [5 x 2]
# 
#   cluster n
# 1       a 2
# 2       b 8
# 3       c 5
# 4       d 1
# 5       e 4

Per your new desired output, you could just left join this result to your melted data set

dta %>% 
  melt(id.vars = 'cluster')  %>% 
  left_join(., count(dta, cluster)) %>%
  arrange(cluster)
#    cluster variable   value n
# 1        a    21:30       . 2
# 2        a    21:30 nuclear 2
# 3        a    21:45       . 2
# 4        a    21:45 nuclear 2
# 5        b    21:30       % 8
# 6        b    21:30 partner 8
# 7        b    21:30   alone 8
#...

edited Aug 13 '15 at 09:26

answered Aug 12 '15 at 21:27

David Arenburg

91,361
17
137
196

1

Thanks David for help ! – giac Aug 12 '15 at 23:34
I realise actually that the result is not the one expected. Because I need to melt first and then count. `dta %>% melt(id.vars = 'cluster') %>% count(cluster)` but this doesn't give me the correct result. Thanks. – giac Aug 13 '15 at 08:58
See my edit on how to easily combine this result with your melted dataset – David Arenburg Aug 13 '15 at 09:27
2

and thanks for avoiding me a night with Excel ! ;) (kidding !) – giac Aug 13 '15 at 09:36

score 1 · Answer 2 · answered Aug 13 '15 at 09:21

When counting the distribution of a variable for repeated observations, one should take into account the number of observations.

In this example

n_episode = 2

Then the code becomes straightforward

dta %>% 
  melt(id.vars = 'cluster')  %>% 
  group_by(cluster) %>% 
  mutate( n_cluster = n() / n_episode) %>% 
  arrange(cluster)

One can use this result (n_episode) in order to calculate mean for groups of different sizes for example.

Count occurrences of a value on melted data using dplyr

2 Answers2