How to use apply() to normalize datamatrix with respect to specific columns

Question

I am trying to normalize the values in a matrix with respect to my controls (R247, R235, R241).

My coldata is:

        Condition  Tank
R235      Control    T6
R236  LowExposure    T6
R239 HighExposure    T6
R241      Control    T8
R242  LowExposure    T8
R245 HighExposure    T8
R247      Control T14_3
R248  LowExposure T14_3
R250 HighExposure T14_3

and my matrix mydata:

                       R235      R236      R239      R241      R242      R245     R247      R248      R250
ENSDARG00000033160 11.91873 10.899929 10.831388 12.092478 11.564555 10.908011 11.67680 11.168115 10.414632
ENSDARG00000013522 12.39036 11.692673 11.439107 12.440952 11.841307 11.118888 12.13594 11.634806 11.336330
ENSDARG00000103295 10.54697 10.004169  8.753556 10.659075  9.980232  8.511240 11.11711 10.690518  9.240825
ENSDARG00000056765  9.18106  8.488917  7.431641  9.440119  8.830816  7.901337 10.39879  9.899546  8.142807
ENSDARG00000087303 11.07447 10.765197 11.682291 11.010172 10.380666 11.487207 11.05384 10.526109 11.962465
ENSDARG00000018478 11.51562 11.000702 10.382845 11.597848 11.218944 10.185381 11.61043 11.214280 10.614338

I extract the control samples via:

x <- which(coldata$Condition %in% "Control")
control <- row.names(coldata[x,])

Similar to a z-score transformation I would like to use the mean and sd but only from the control groups to transform the dataset like (x - mean[control]) / sd[control] with something like:

function(x){
(x - rowMeans[,control])/apply(matrix[,control],1,sd)
}

and then use apply() to run this over mydata like: apply(mydata, 1, function(x)) but I don't know how to properly write this as a function that can be used via apply. Any help is highly appreciated. Thx!

ThomasIsCoding · Answer 1 · 2020-01-20T12:29:03.757

Maybe you can try the following code

ctr<-c(mydata[,control])
mydata <- (mydata - mean(ctr))/sd(ctr)

such that

> mydata
                         R235       R236       R239       R241         R242       R245       R247
ENSDARG00000033160  0.7649126 -0.3416566 -0.4161023  0.9536287  0.380225962 -0.3328783  0.5021407
ENSDARG00000013522  1.2771728  0.5193811  0.2439708  1.3321233  0.680819738 -0.1038346  1.0008349
ENSDARG00000103295 -0.7250225 -1.3145850 -2.6729365 -0.6032598 -1.340584125 -2.9361276 -0.1057658
ENSDARG00000056765 -2.2086036 -2.9603737 -4.1087325 -1.9272271 -2.589020615 -3.5985729 -0.8859680
ENSDARG00000087303 -0.1520791 -0.4879955  0.5081047 -0.2219163 -0.905653327  0.2962145 -0.1744864
ENSDARG00000018478  0.3270753 -0.2322021 -0.9032866  0.4163871  0.004841085 -1.1177618  0.4300530
                            R248       R250
ENSDARG00000033160 -0.0503667586 -0.8687612
ENSDARG00000013522  0.4565289817  0.1323397
ENSDARG00000103295 -0.5691080347 -2.1436899
ENSDARG00000056765 -1.4282211043 -3.3363006
ENSDARG00000087303 -0.7476806273  0.8124153
ENSDARG00000018478 -0.0002247121 -0.6518508

DATA

 coldata <- structure(list(Condition = c("Control", "LowExposure", "HighExposure", 
"Control", "LowExposure", "HighExposure", "Control", "LowExposure", 
"HighExposure"), Tank = c("T6", "T6", "T6", "T8", "T8", "T8", 
"T14_3", "T14_3", "T14_3")), class = "data.frame", row.names = c("R235", 
"R236", "R239", "R241", "R242", "R245", "R247", "R248", "R250"
))

mydata <- structure(c(0.764912614946124, 1.27717284283513, -0.725022482925137, 
-2.20860361193701, -0.152079137057719, 0.327075283851119, -0.341656586405671, 
0.519381138288775, -1.31458498734254, -2.96037370907202, -0.487995549202655, 
-0.232202141300273, -0.416102292318654, 0.243970801913007, -2.67293645010139, 
-4.10873247484514, 0.508104744323286, -0.90328660925782, 0.95362874851536, 
1.3321232689093, -0.603259802757428, -1.92722706172457, -0.221916314787733, 
0.416387104598013, 0.380225961822725, 0.680819737852109, -1.34058412453691, 
-2.58902061521661, -0.905653326889375, 0.00484108465005354, -0.332878334043016, 
-0.103834591964374, -2.93612761559372, -3.59857285819956, 0.296214545867932, 
-1.11776184119785, 0.502140702783651, 1.00083493562074, -0.105765764038217, 
-0.885967971059041, -0.174486381086619, 0.430053025314037, -0.0503667586240605, 
0.456528981709989, -0.569108034749636, -1.42822110426138, -0.747680627262844, 
-0.000224712061085116, -0.868761206158128, 0.132339715167575, 
-2.14368994546163, -3.33630057435765, 0.812415320598279, -0.651850829229599
), .Dim = c(6L, 9L), .Dimnames = list(c("ENSDARG00000033160", 
"ENSDARG00000013522", "ENSDARG00000103295", "ENSDARG00000056765", 
"ENSDARG00000087303", "ENSDARG00000018478"), c("R235", "R236", 
"R239", "R241", "R242", "R245", "R247", "R248", "R250")))

Hi ThomasIsCoding, sorry maybe I wasn't clear about it. I wish to scale all values (Control and non control groups) with respect to the mean and sd of the control samples and that needs to be done per row. ```scale()```to my knowledge only works column wise. But it goes in the right direction. — han5000, Jan 20 '20 at 12:06
I see what your are doing there. Just tried the code but it gives me this error: ```Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) : is.atomic(x) is not TRUE``` — han5000, Jan 20 '20 at 12:26
@han5000 I have no information about your data type. You can try the code with the data in my answer — ThomasIsCoding, Jan 20 '20 at 12:30
@ThomasIsCoding The issue in your code seems to be that you are creating a vector instead keep it as a matrix `ctr<- mydata[,control]` and then based on the OP's answer, should have `rowMeans/rowSds` `library(matrixStats);(mydata - rowMeans(ctr))/rowSds(ctr)` — akrun, Jan 20 '20 at 17:54

han5000 · Accepted Answer · 2020-01-21T12:46:49.787

@ThomasIsCoding thank you so much for your help! I got it sorted out via:

ctrM <- apply(mydata[,control], 1, FUN = mean)
Sd <- apply(mydata, 1, FUN = sd)
new <- (mydata - ctrM)/Sd #centering around ctrM and scaling with Sd

Sorry that I didn't provide info about my data structure but this works now:) I also realized that for the purpose of scaling the rows I need to use the overall Sd for the entire row and not only for those of the mean. Hence I am centering for the mean of the control and scale for the row's Sd. This does the job now.

How to use apply() to normalize datamatrix with respect to specific columns

2 Answers2