get frequency based on two columns

Question

A snippet of my large dataframe that looks this way :

MARKERS.IN.HAPLOTYPES BASE           rs. alleles chrom       pos        GID marker   trial
                 1A.12    C S1A_494392059     C/G    1A 494392059 GID7173723      2 ES26-38
                 1A.13    C S1A_497201550     C/T    1A 497201550 GID7173723      0 ES26-38
                 1A.14    T S1A_499864157     C/T    1A 499864157 GID7173723      2 ES26-38
                 1B.10    A S1B_566171302     G/A    1B 566171302 GID7173723      0 ES26-38
                 1B.20    G S1B_642616640     A/G    1B 642616640 GID7173723      2 ES26-38
                 2B.10    A  S2B_24883552     A/G    2B  24883552 GID7173723      2 ES26-38

Here is a dput of it:

structure(list(MARKERS.IN.HAPLOTYPES = c("1A.12", "1A.13", "1A.14", 
"1B.10", "1B.20", "2B.10"), BASE = c("C", "C", "T", "A", "G", 
"A"), rs. = c("S1A_494392059", "S1A_497201550", "S1A_499864157", 
"S1B_566171302", "S1B_642616640", "S2B_24883552"), alleles = c("C/G", 
"C/T", "C/T", "G/A", "A/G", "A/G"), chrom = c("1A", "1A", "1A", 
"1B", "1B", "2B"), pos = c(494392059L, 497201550L, 499864157L, 
566171302L, 642616640L, 24883552L), GID = c("GID7173723", "GID7173723", 
"GID7173723", "GID7173723", "GID7173723", "GID7173723"), marker = c("2", 
 "0", "2", "0", "2", "2"), trial = c("ES26-38", "ES26-38", "ES26-38", 
 "ES26-38", "ES26-38", "ES26-38")), row.names = c(NA, 6L), class = 
 "data.frame")

There are 22 unique values for the columns rs. in the original dataframe and there are six unique values for the column trial. I would like to calculate the relative frequencies of the different values of column marker for each unique rs. and each unique trial. So for example, the first item of column rs. S1A_494392059would have the frequencies of column marker for trial ES26-38 and so on, so forth. Please note that the column marker is a character vector and not numeric.

Maybe you can try `df %>% group_by(trial, marker, rs.) %>% tally()`. — tmfmnk, Feb 26 '19 at 20:17
Or just `df %>% add_count(rs., trial, marker)`. You can also tailor the `name` by the argument if you're using `dplyr 0.8` or above. — arg0naut91, Feb 26 '19 at 20:18
ah sorry. I mean the the relative frequency of the `marker` column — moth, Feb 26 '19 at 20:21

arg0naut91 · Accepted Answer · 2019-02-26T20:44:44.513

1

You can try this:

library(dplyr)

df %>%
  add_count(rs., trial, name = "Total") %>%
  add_count(rs., trial, marker, name = "MarkerTotal") %>%
  mutate(RelativeFreq = round(MarkerTotal / Total, 2))

The name column in add_count is a new feature from dplyr 0.8 onwards that allows you to decide on the name (previously would be n or nn by default). The above code won't work if you don't have the package up to date.

The relative frequencies in your example will be 1 everywhere though as it's not particularly complex.

This is what you could do if you'd like to get a summarised data frame (where the only columns left will be grouping rs., trial and RelativeFreq):

df %>% 
  add_count(rs., trial, marker, name = "MarkerTotal") %>%
  group_by(rs., trial) %>%
  summarise(RelativeFreq = round(MarkerTotal / n(), 2))

edited Feb 26 '19 at 20:44

answered Feb 26 '19 at 20:33

arg0naut91

14,574
2
17
38

nice thanks. I am trying to fetch form the summarised dataframe the column `base` with `semi_join` and `inner_join` but one the first doesnt work and the second expand the data into many other rows. – moth Feb 26 '19 at 21:15
You're welcome! For the joins, you'll need to post and elaborate as I cannot think of something simple; best open a new topic with examples (best with `dput`). – arg0naut91 Feb 26 '19 at 21:20

get frequency based on two columns

1 Answers1