Get frequency using two groupings in a dataframe

Question

I have a dataframe as follows (called dat)

chr   chrStart  chrEnd  Gene    RChr    RStart  REnd    Rname   distance
chr1    39841   39883   Gene1   chr1    398     3984    Cha1b   0
chr1    39841   39883   Gene1   chr1    398     3985    Ab      0
chr1    39841   39883   Gene1   chr1    398     3986    Tia     0
chr1    39841   39883   Gene1   chr1    398     3987    MEA     0
chr1    39841   39883   Gene1   chr1    398     3988    La      0
chr1    39841   39883   Gene1   chr1    398     3989    M3      0
chr1    14893   15893   Gene2   chr1    398     3984    Cha1b   0
chr1    14893   15893   Gene2   chr1    398     3985    Cha1b   0
chr1    14893   15893   Gene2   chr1    398     3986    Cha1b   0
chr1    14893   15893   Gene2   chr1    398     3987    MEA     0
chr1    14893   15893   Gene2   chr1    398     3988    MEA     0
chr1    39841   39883   Gene1   chr1    398     3989    M3      0

I want to get the frequency that the different types of Rname appear for each gene so the result above should look like

Gene     Rname      Freq
Gene1    Cha1b       1
Gene1      Ab        1
Gene1     Tia        1
Gene1     MEA        1
Gene1     La         1
Gene1     M3         1
Gene2    Cha1b       3
Gene2    MEA         2
Gene2     M3         1

I tried doing two groupings with dplyr but I think it makes no sense and anyway it just gives me the frequency of all the Rnames for each gene

library(dplyr)
GroupTbb <- dat %>% 
                group_by(Gene) %>% 
                group_by(Rname) %>% 
                summarise(freq = sum(Rname))

a `base R` option is `subset(as.data.frame(table(dat[c('Gene', 'Rname')])), Freq!=0)` — akrun, Apr 05 '15 at 12:22

David Arenburg · Accepted Answer · 2015-04-05T12:22:51.530

3

You should use n() (as you can't sum non-numeric values) in order to count the observations and you can group by two variables at once.

dat %>% 
  group_by(Gene, Rname) %>% 
  summarise(freq = n())

# Source: local data frame [8 x 3]
# Groups: Gene
# 
# Gene Rname freq
# 1 Gene1    Ab    1
# 2 Gene1 Cha1b    1
# 3 Gene1    La    1
# 4 Gene1    M3    2
# 5 Gene1   MEA    1
# 6 Gene1   Tia    1
# 7 Gene2 Cha1b    3
# 8 Gene2   MEA    2

Or use tally as in

dat %>% 
  group_by(Gene, Rname) %>% 
  tally()

Or (as suggested by @hrbrmstr) you can skip the grouping step by using count

dat %>%
  count(Gene, Rname)

edited Apr 05 '15 at 12:22

answered Apr 05 '15 at 12:12

David Arenburg

91,361
17
137
196

If I wanted to get it into the format so that I have Gene Name along the rows and Rname along the columns how would I do that (happy to ask a separate question if necessary) – Sebastian Zeki Apr 05 '15 at 12:29
1

@user362206 Just use `table` as in the comments for that or you may need `spread` from `tidyr` or `dcast` from `reshape2` – akrun Apr 05 '15 at 12:31

Colonel Beauvel · Answer 2 · 2015-04-05T12:24:29.733

3

You can try data.table:

library(data.table)
setDT(dat)[,list(count=.N), list(Gene, Rname)]

#    Gene Rname count
#1: Gene1 Cha1b     1
#2: Gene1    Ab     1
#3: Gene1   Tia     1
#4: Gene1    M3     2
#5: Gene2 Cha1b     3
#6: Gene2   MEA     2
#7: Gene1   MEA     1
#8: Gene1    La     1

edited Apr 05 '15 at 12:24

answered Apr 05 '15 at 12:15

Colonel Beauvel

30,423
11
47
87

This one also gave me what I wanted but decided to go for the one above – Sebastian Zeki Apr 05 '15 at 12:23
1

No problem! If you prefer dplyr, feel free to use it of course ;) – Colonel Beauvel Apr 05 '15 at 12:25

Get frequency using two groupings in a dataframe

2 Answers2

Linked