0

Data I also have the total number of cancer patients (case_totals) and non-cancer patients(control_totals) which in this case is 100 and 1000 respectively.

Variant  Cancer IBD AKI CKD CCF IHD
A1         0    5   4   0   0   4
A2         0    8   5   9   0   7
A3         20   9   6   7   0   3
B5         7    2   0   6   5   4
K7         9    1   8   4   2   5
L9         0    0   6   3   3   1

Desired outcome - two tables: Table1:

 Variant     case_total not_seen_in_cases_total control_total not_seen_in_control_total
    A1             0           100                    13                  987  
    A2             0           100                    25                  975 
    A3             20          80                     25                  975
    B5             7           93                     17                  983
    K7             9           91                     20                  980
    L9             0           100                    13                  987

Table2:

case_total_in_gene  not_seen_in_gene_cases      control_total_in_gene control_total_not_in_gene
36                         64                            113                 887

I will then run a fishers across both tables to get a per variant and per gene p.value which I can do.

My issue is that I have multiple such datasets and in each the order of the columns of the input is different. At present I have been using:

ncol(dt) #to get the total number of columns as in reality the table is very large
which(colnames(dt)=='Cancer') #get the index column 
dt$control_total <- (rowSums(dt[,2:7])) - rowSums(dt[,2]) #get a control totals per row column 

And then subsetting dt and just adding in the other columns using subtraction e.g. dt$not_seen_in_control_total <- 1000 - dt$control_total

This won't work with shifting column indices and I want to run this across hundreds of files ideally using a commandArgs.

Ultimately how do I reference a column which will always have the same name but will be in different places in a function like RowSums etc?

Many thanks

tacrolimus
  • 500
  • 2
  • 12

1 Answers1

1

You can select column names by position or pattern in names or by specifying range of columns. It depends on how your data is structured.

library(dplyr)

table1 <- df %>%
  mutate(control_total = rowSums(select(., setdiff(2:ncol(.), 
                                 match('Cancer', names(.)))))) %>%
  transmute(Variant, Cancer, 
            not_seen_in_cases_total = 100 - Cancer, 
            control_total, 
            not_seen_in_control_total = 1000 - control_total)
table1

#  Variant Cancer not_seen_in_cases_total control_total not_seen_in_control_total
#1      A1      0                     100            13                       987
#2      A2      0                     100            29                       971
#3      A3     20                      80            25                       975
#4      B5      7                      93            17                       983
#5      K7      9                      91            20                       980
#6      L9      0                     100            13                       987

table2 <- table1 %>%
  summarise(case_total_in_gene = sum(Cancer), 
            not_seen_in_gene_cases = 100 - case_total_in_gene, 
            control_total_in_gene = sum(control_total), 
            control_total_not_in_gene = 1000 - control_total_in_gene)

table2
# case_total_in_gene not_seen_in_gene_cases control_total_in_gene control_total_not_in_gene
#1                 36                     64                   117                       883

data

df <- structure(list(Variant = c("A1", "A2", "A3", "B5", "K7", "L9"
), Cancer = c(0L, 0L, 20L, 7L, 9L, 0L), IBD = c(5L, 8L, 9L, 2L, 
1L, 0L), AKI = c(4L, 5L, 6L, 0L, 8L, 6L), CKD = c(0L, 9L, 7L, 
6L, 4L, 3L), CCF = c(0L, 0L, 0L, 5L, 2L, 3L), IHD = c(4L, 7L, 
3L, 4L, 5L, 1L)), class = "data.frame", row.names = c(NA, -6L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks Ronak. In reality there will always be a cancer column but any number of other columns too with different disease names. Is there a way of selecting all the columns from 2 onwards bar the "cancer" one? – tacrolimus Oct 20 '20 at 12:11
  • 1
    Select all columns from 2 onwards removing `Cancer` column. Change the `mutate` + `select` line (2nd line) to `mutate(control_total = rowSums(select(setdiff(2:ncol(.), match('Cancer', names(.))))))` – Ronak Shah Oct 20 '20 at 12:15
  • I get: Error: Problem with 'mutate()' input 'control_total'. x no applicable method for 'select_' applied to an object of class "c('integer', 'numeric')". Input 'control_total' is 'rowSums(...)'. – tacrolimus Oct 20 '20 at 13:00
  • @tacrolimus I forgot the `"."` in `select`. I have also updated the answer. Can you check the answer now? – Ronak Shah Oct 20 '20 at 13:07
  • I get: Error in UseMethod("mutate_"): no applicable method for 'mutate_2 applied to an object of class "function". In addition: Warning message: 'mutate_()' is depreceated as of dplyr 0.7.0. Please use 'mutate()' instead. – tacrolimus Oct 20 '20 at 13:13
  • @tacrolimus I updated the post with the data that I am using. Can you check if it works for you on that data? 1) If it does not work on the data shared in my answer I think you might have `plyr` loaded which is causing `mutate` to fail. Try with `dplyr::mutate` and `dplyr::select`. 2) If it works for my data and not on your data can you provide your data using `dput` similar to my answer? – Ronak Shah Oct 20 '20 at 14:08