-4

I have a dataset with 15 metrics (columns) from a csv. 1 metric is called Cancer

This is what the column in the dataset looks like

Cancer:  yes no yes no

I would like to create a table with the percentages Cancer Yes No But I am making different subsets (e.g filtered dataset 1: agegroup 50-54 and numberrelatives=1, filtered dataset 2: agebirtfirstchild <30, breastdensity:extremely dense) I would like to make 1 table with the percentages cancer yes/no with all the different subsets

example dataset:

`cancer = c("yes", "no") 
 agegroup = c("35-39", "40-44") 
 numberrelatives = c("zero", "one") 
 agefirstchild = c("Age < 30", "Age 30 or greater") 
 df = data.frame(cancer, agegroup, numberrelatives, agefirstchild)`
Kirsten
  • 37
  • 1
  • 6
  • The vectors you supplied for the example dataset have differing number of rows - a data frame can't be made from that. Maybe use the `rep()` function to repeat each vector few times so they all have the same length (e.g. 6, 4, 4, and 4 times)? – Adam B. Feb 09 '20 at 20:28
  • 3
    Read up on these functions: `table`, `xtabs`, `prop.table`, `addmargins` and `margin.table`. You can make your question(s) more specific by including a sample of your data using `dput` or by indicating one of the data sets included with R that has data similar to yours. If you actually run the code in your example, you will see that it does not run. – dcarlson Feb 09 '20 at 20:28

3 Answers3

0

Yes thanks it works partly without the group by, it gives me the summary from 1 dataframe/dataset. But I would like to plot different filtered dataframes/filtered datasets in 1 table -> filtered dataset 1 no% filtered dataset 2 no%

Kirsten
  • 37
  • 1
  • 6
-1

With dplyr you can do the following:

df %>%
   group_by(agegroup, numberrelatives, agefirstchild) %>%
   summarize(prop_cancer = mean(cancer == 'yes'))

Note that the table will be in a long format (but there are ways to make it wide).

Adam B.
  • 788
  • 5
  • 14
  • Yes thanks it works partly without the group by, it gives me the summary from 1 dataframe/dataset. But I would like to plot different filtered dataframes/filtered datasets in 1 table -> filtered dataset 1 no% filtered dataset 2 no% – Kirsten Feb 09 '20 at 21:00
  • Can you also give partial points :) – Kirsten Feb 09 '20 at 21:46
  • You can use the same method for an arbitrary number of variables. E.g. you could do `group_by(agegroup)`, `group_by(numberrelatives)` separately for "main effects" of `agegroup` and `numberrelatives`). By that logic, when you `group_by` more variables simultaneously, you're checking the interactions of the categorical variables you grouped by. – Adam B. Feb 09 '20 at 21:59
  • Yes this indeed does what it needs to do as well. Thanks – Kirsten Feb 10 '20 at 21:08
  • I voted your answer as a solution but someone closed the question because it was already answered :( think it therefore turned gray again – Kirsten Feb 10 '20 at 21:47
  • No worries, the important thing is you got your code working! – Adam B. Feb 11 '20 at 06:12
  • Yes thanks for that. The outcome gives valuable insights :) – Kirsten Feb 11 '20 at 10:50
-1

Here are some approaches with base R. But first we need some reproducible data:

set.seed(42)
cancer <- sample(c("yes", "no"), 200, replace=TRUE) 
agegroup <- sample(c("35-39", "40-44", "45-49"), 200, replace=TRUE)  
numberrelatives <- sample(c("zero", "one", "2 or more"), 200, replace=TRUE)  
agefirstchild <- sample(c("Age < 30", "Age 30 or greater", "nullipareous"), 200, replace=TRUE) 
dat <- data.frame(cancer, agegroup, numberrelatives, agefirstchild)

Now you can create tables:

(tbl <- xtabs(~agegroup+cancer, dat))
#         cancer
# agegroup no yes
#    35-39 38  31
#    40-44 38  32
#    45-49 35  26
addmargins(tbl)
#         cancer
# agegroup  no yes Sum
#    35-39  38  31  69
#    40-44  38  32  70
#    45-49  35  26  61
#    Sum   111  89 200

Or percentages:

options(digits=3)
prop.table(tbl, 1) * 100
#         cancer
# agegroup   no  yes
#    35-39 55.1 44.9
#    40-44 54.3 45.7
#    45-49 57.4 42.6
prop.table(tbl, 2) * 100
#         cancer
# agegroup   no  yes
#    35-39 34.2 34.8
#    40-44 34.2 36.0
#    45-49 31.5 29.2
dcarlson
  • 10,936
  • 2
  • 15
  • 18
  • Works beautiful! It gives great insight in where the problems in the subgroups are. Thank you! Would it be possible to add to the table agegroup + cancer also e.g the column agefirstchild. So you can look at the combination? agefirstchild agegroup no yes <30 35-39 34 42 – Kirsten Feb 09 '20 at 21:46
  • Yes. For example try `xtabs(agegroup+agefirstchild+cancer, dat)`. This creates a 3-dimensional table. To customize which variables are rows and which are columns, you use `ftable()`. – dcarlson Feb 10 '20 at 03:52
  • I voted your answer as a solution but someone closed the question because it was already answered :( think it therefore turned gray again – Kirsten Feb 10 '20 at 21:52