Writing a function to filter and summarize data into proportion table

Question

I want to create a large proportion table that involves filtering out certain values based on one column and outputting the proportion of values equal to 0 and those greater than 0 in table. Here's an example of the data frame (df):

     ID   a   b   c   d   e   f   g
1     1   1   2   3   0   4   5   A 
2     2   0   0   1   0   2   0   A
3     3   1   5   2   1   0   0   B
4     4   5   1   2   0   1   1   B
5     5   2   0   1   0   0   0   C
...

From this, I want to come up with the proportion that b=0 or b>0 IF column a>0. For your reference, I can get this information with the following code:

prop.table(table(df$b[df$a>0]!=0))*100

However, I want to do the same with columns c and d as well as e and f (same sort of pattern so that you're filtering out when c=0 and when e=0 to get those >0 and =0 proportions for d and f, respectively). Additionally, I would love to have this output all into a single table. Might look something like this:

      b.perc   d.perc   f.perc
TRUE   75.00    20.00    66.67
FALSE  25.00    80.00    33.33

Any help is appreciated. Also, I would like to calculate the TRUE percentages across groups listed in column G, giving me an output like this:

      b.perc   d.perc   f.perc
A     100.00    0.00     50.00
B     100.00   50.00    100.00
C     0.00      0.00      0.00

akrun · Accepted Answer · 2019-07-02T17:36:42.617

2

We subset the alternate columns, use each set as inputs to mapply, get the table and prop.table based on the condition mentioned in the OP's post

out <- round(mapply(function(x, y) prop.table(table(x[y > 0] != 0)) * 100,
          df[c(FALSE, TRUE)], df[c(TRUE, FALSE)]), 2)
colnames(out) <- paste0(colnames(out), ".perc")
out
#      b.perc d.perc f.perc
#FALSE     25     80  33.33
#TRUE      75     20  66.67

If we are just interested in the TRUE percentage, then we can do this with colMeans as well

colMeans((df[c(FALSE, TRUE)] * NA^!(df[c(TRUE, FALSE)] > 0)) != 0, na.rm = TRUE)
#       b         d         f 
#0.7500000 0.2000000 0.6666667

data

df <- structure(list(a = c(1L, 0L, 1L, 5L, 2L), b = c(2L, 0L, 5L, 1L, 
0L), c = c(3L, 1L, 2L, 2L, 1L), d = c(0L, 0L, 1L, 0L, 0L), e = c(4L, 
2L, 0L, 1L, 0L), f = c(5L, 0L, 0L, 1L, 0L)), class = "data.frame",
row.names = c("1", 
"2", "3", "4", "5"))

edited Jul 02 '19 at 17:36

answered Jul 02 '19 at 17:23

akrun

874,273
37
540
662

the dataframe I gave you is only part of a much larger dataset (around 50 variables)- how could I adapt this script to pull out specific columns from the larger dataset (besides selecting and creating a new, smaller dataframe)? – Kfin Jul 02 '19 at 17:55
@Kayla Assuming that you have 50 columns and the columns of interest are the first 20 columns, then use `df1 <- df[, 1:25]` and apply the codes on the subset dataset – akrun Jul 02 '19 at 19:00
okay- only trying this with a subset (basically what I presented before, so 6 total columns), and I'm running into this error: Error: Length of logical index vector for `[` must equal number of columns (or 1): * `.data` has 6 columns * Index vector has length 2 – Kfin Jul 02 '19 at 20:07
@Kayla What is the error? Do you have a `matrix` of `data.frame` ? Please check the `str(df)`. what is the output of `df[c(FALSE, TRUE)]` and `df[, c(FALSE, TRUE)]` – akrun Jul 02 '19 at 20:08
the output of df[c(FALSE, TRUE)] and df[, c(FALSE, TRUE)] are both the same error as before (Error: Length of logical index vector for `[` must equal number of columns (or 1): * `.data` has 6 columns * Index vector has length 2) – Kfin Jul 08 '19 at 14:18
@Kayla It is not clear about the structure of your data. If you can dput the example and show the error, easier to understandd where it is not working – akrun Jul 08 '19 at 14:21
1

sorry- I got it to work. The issue was with the select function (dplyr vs. plyr) when I was creating a new smaller dataframe. This worked, thank you! – Kfin Jul 08 '19 at 14:50
wondering if you could help me use this code to add a group_by function so that I can get a table with the TRUE percents across categorical variables(i.e. A, B, B, A, C) located in a 7th column. – Kfin Jul 08 '19 at 15:56
@Kayla Can you update your post with expected output. Is it based on the `f` column – akrun Jul 08 '19 at 15:57
updated. Added a column g (but still only want to have the proportions based on columns a-f) – Kfin Jul 08 '19 at 16:56
@Kayla Do you need `lapply(split(df1[-c(1, 8)], df1$g), function(df) colMeans((df[c(FALSE, TRUE)] * NA^!(df[c(TRUE, FALSE)] > 0)) != 0, na.rm = TRUE))` – akrun Jul 08 '19 at 17:00
I'm running into an error (Error in FUN(left, right) : non-numeric argument to binary operator) and I've tried adding in as.numeric() into the function to try to solve this problem without any luck – Kfin Jul 08 '19 at 20:37
@kayla Can you check the `str(df1)` Here, I removed the 1st and 8th column as it seemed to be either character or the ID column. If the other columns are not numeric it would have an isssue. – akrun Jul 08 '19 at 20:39
1

sorry found the issue- slightly different order. Thank you again for all of your help – Kfin Jul 08 '19 at 20:50

Writing a function to filter and summarize data into proportion table

1 Answers1

data