Data I also have the total number of cancer patients (case_totals) and non-cancer patients(control_totals) which in this case is 100 and 1000 respectively.
Variant Cancer IBD AKI CKD CCF IHD
A1 0 5 4 0 0 4
A2 0 8 5 9 0 7
A3 20 9 6 7 0 3
B5 7 2 0 6 5 4
K7 9 1 8 4 2 5
L9 0 0 6 3 3 1
Desired outcome - two tables: Table1:
Variant case_total not_seen_in_cases_total control_total not_seen_in_control_total
A1 0 100 13 987
A2 0 100 25 975
A3 20 80 25 975
B5 7 93 17 983
K7 9 91 20 980
L9 0 100 13 987
Table2:
case_total_in_gene not_seen_in_gene_cases control_total_in_gene control_total_not_in_gene
36 64 113 887
I will then run a fishers across both tables to get a per variant and per gene p.value which I can do.
My issue is that I have multiple such datasets and in each the order of the columns of the input is different. At present I have been using:
ncol(dt) #to get the total number of columns as in reality the table is very large
which(colnames(dt)=='Cancer') #get the index column
dt$control_total <- (rowSums(dt[,2:7])) - rowSums(dt[,2]) #get a control totals per row column
And then subsetting dt and just adding in the other columns using subtraction e.g. dt$not_seen_in_control_total <- 1000 - dt$control_total
This won't work with shifting column indices and I want to run this across hundreds of files ideally using a commandArgs.
Ultimately how do I reference a column which will always have the same name but will be in different places in a function like RowSums etc?
Many thanks