How to doubly normalize scientific data by baseline time point and then controls in R

Question

I have a data.table with a bunch of parameters (amplitude, rate, area, etc..there are 23 in total) that belong to specific wells (singular experiment, if you will, there are 48 in total), grouped by treatments (there are usually ~10 in total), and all of this is at different time points (there can be many). I would like to first take each well and normalize (as in, divide) all the parameters by the median parameters at baseline (all time points before "zero" time), and then take that normalized data and normalize it again, but this time by control treatment group, for each time point. I would also like to take a look at the baseline and control data beforehand and flag and remove outliers, if necessary, prior normalization (although this is not extremely important at the moment; I can probably figure this out once I realize how to accomplish the normalizations)

As an example, I will create a similar data.table to what I am generating in my raw instrument data analysis code:

dt = data.table(
  wellID = as.factor(c ("A4", "B4", "C5", "D5", "A4", "B4", "C5", "D5","A4", 
  "B4", "C5", "D5")),
  treatment = as.factor (c ("Control", "Control", "Drug", "Drug", "Control", 
  "Control", "Drug", "Drug", "Control", "Control", "Drug", "Drug")),
  time_h = c (-0.2, -0.2, -0.2, -0.2, -0.1, -0.1, -0.1, -0.1, 4, 4, 4, 4),
  area = runif (12, min = 0.5, max = 0.9),
  amp = runif (12, min = 0.1, max = 0.2),
  rate = runif (12, min = 33, max = 38)
)

I tried things like:

baseline = subset (dt, subset = time_h < 0 )

to isolate the baseline timepoints, and then:

base_medians = by (baseline [ , (4: ncol (baseline)) ], baseline$ wellID, 
           function (x) {
             apply (x, 2, median)
           })

to get the baseline medians for each well, but then I don't really know how to normalize the data in dt so that the wells and the parameters are matched, and then the second normalization?

I don't think this a good strategy anyhow, should I be deconstructing and reconstructing my dataset somehow?

Any help is appreciated!

score 1 · Accepted Answer · answered Jul 12 '17 at 14:35

This might require some tweaking for the subsetting if this isn't exactly what you're looking for. This divides the parameter columns by the median values when time_h < 0 and then when treatment == "Control"

set.seed(21)  #good practice for questions so results are reproducible

parm <- c("area", "amp", "rate")  #parameters to include
dt[, (parm) := lapply(.SD, function(x) x / median(x[time_h < 0])), .SDcols = parm]
dt[, (parm) := lapply(.SD, function(x) x / median(x[treatment == "Control"])), .SDcols = parm]

    wellID treatment time_h      area       amp      rate
 1:     A4   Control   -0.2 0.9541129 0.7538275 0.9403151
 2:     B4   Control   -0.2 0.7040382 1.1530667 1.0081769
 3:     C5      Drug   -0.2 0.9134096 0.8369863 0.9780808
 4:     D5      Drug   -0.2 0.6721809 0.7392173 1.0067250
 5:     A4   Control   -0.1 1.0354136 1.0865999 0.9978287
 6:     B4   Control   -0.1 1.0162338 0.9134001 0.9918002
 7:     C5      Drug   -0.1 0.6334486 1.0678871 1.0280474
 8:     D5      Drug   -0.1 0.6664317 1.1639014 0.9696164
 9:     A4   Control    4.0 1.0477798 0.7204991 1.0021713
10:     B4   Control    4.0 0.9837662 1.1454020 1.0149003
11:     C5      Drug    4.0 0.8985494 1.2648977 1.0190920
12:     D5      Drug    4.0 1.0239782 1.3705952 0.9268626

Thanks @manotheshark ! Turns out I knew nothing about data.tables, and now know a little more! I had to add by = "wellID" to the first normalization, and by = "time_h" to the second normalization for proper subsetting. For some reason, it was buggy when subsetting by "params" and saving the results to "params", so I had to make new columns to save to "norm_params". This was preferred anyhow, to preserve the original data — JVP, Jul 20 '17 at 14:34

How to doubly normalize scientific data by baseline time point and then controls in R

1 Answers1