-1

I know there are so many threats answering similar questions to mine but none of the answers out there are specific enough to what I want to obtain.

I've got the following dataset: table description

I want to count the number of patients (found in "Var_name") that harbour each mutation (found in "var_id") and display the count in a new column ("var_freq"). I've tried things like:

y <- ALL_merged %>%
  group_by(var_id, Var_name) %>%
  summarise(n_counts = n(), var_freq = sum(var_id == Var_name))

NOTE: In case is relevant for the answers... I had to convert "var_id" and "Var_name" into characters to make this work because they were factors.

However, this does not give me the output I want. Instead, I get the count of how many times each "var_id" appear per patient since, for each "var_id", the same "Var_name" appears a lot of times (because rows contain additional columns with different information), so the final outcome gives me a higher count that I would expect:

enter image description here

I also want to add this new column to the original dataset, I believe this could be done for example by using "mutate". But not sure how to set up everything...

So, in sum, what I want to get is: for each "var_id" how many different "Var_name" I have - taking into account that these data is duplicated...

Thanks in advance!

DoRemy95
  • 614
  • 3
  • 19
  • Please provide a reproducible example along with expected output. Images are not helpful. Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Dec 03 '20 at 09:56

1 Answers1

0

It is not entirely clear what you are looking for. It would help to provide data without images (such as using dput) and clearly show what your output should be for example data.

Based on what you describe this might be helpful. First, group_by just var_id alone. Then in summarise, you can include n() to get the number of observations/rows of data for each var_id, and then n_distinct to get the number of unique Var_name for each var_id:

library(dplyr)

df %>%
  group_by(var_id) %>%
  summarise(n_counts = n(), 
            var_freq = n_distinct(Var_name))
Ben
  • 28,684
  • 5
  • 23
  • 45