0

Prior to calculating a PCA, I need to normalize my data. I have a matrix where the row names represent the disease group ( 0 represents control, 1 is Ulcerative Colitis and 2 is Crohn's). The rest of the data represents gene expression values.

I have tried log transformation which did not normalize ( as confirmed through plotting histograms for some of the columns and also through the Anderson-Darling test).

Update: I am trying the Box-Cox transformation. I am not sure how to convert my matrix of values into a linear model class prior to using the below ( where lm would be replaced by my data). I understand the lm formula has to be in the form of response ~ terms, where terms specify a linear predictor for the response.

      bc=boxcox(Gene1 ~ 1, lambda=seq(-2, 2))  (as suggested in comments). 

Not sure whether I would need to change the terms variable to disease (once disease column has been added to data).

         bc=boxcox(Gene1 ~ disease , lambda=seq(-2,2))

         best.lam=bc$x[which(bc$y==max(bc$y))]

There are 24 rows and 13 columns. How would I easily apply the transformation to each column in the data set?

Firstl, I am unsure how I would linearise each column quickly. When you ?lm, it states that if the response variable is a matrix, then you can use model.matrix to fit a linear model to individual columns prior to calculating boxcox. However, there are no examples of this online or in R help.

Secondly, I am unsure how I would then alter the y values of each column via the corresponding lambda quickly ( potentially a for loop or using one of the apply functions).

Please find below my new data. The real thing contains over 600 genes and 190 rows. Any further help would be appreciated.

     structure(c(5.54e-05, 5.58e-06, 9.74e-05, 1.33e-06, 1.29e-05, 
     7.22e-06, 0.000215899, 3.6e-06, 0.000146724, 1.53e-05, 0.000913187, 
     1.9e-06, 0.007421464, 0.000648006, 5.1e-06, 6.15e-06, 4.73e-06, 
     0.000119899, 0.000884487, 0.000850632, 0.000236607, 7.36e-06, 
     8.48e-06, 2.63e-05, 0.001368493, 1.12e-05, 0.000177568, 0.006338532, 
     0.006162866, 0.040695132, 0.013255055, 0.033086619, 0.074158811, 
     0.004967497, 0.01247423, 0.043201417, 0.011470285, 0.038447751, 
     0.018825124, 0.027701807, 0.063373762, 0.005374513, 0.048876252, 
     0.009959848, 0.004434078, 0.004176856, 0.015288913, 0.060226053, 
     0.05128922, 0.006557554, 0.017460326, 0.007684784, 0.002107577, 
     0.005773192, 0.076186393, 0.037631043, 0.052159393, 0.012179365, 
     0.047199766, 0.022458838, 0.030261613, 0.00626629, 0.028664896, 
     0.02285845, 0.02801855, 0.017681676, 0.040563592, 0.029791175, 
     0.034778056, 0.019318473, 0.011847912, 0.009614177, 0.064027542, 
     0.035334149, 0.041638955, 0.056015014, 0.03304865, 0.017660205, 
     0.030187166, 0.057919531, 0.029990489, 0.000112884, 0.000920886, 
     0.001081748, 0.000195159, 0.001678445, 0.000171612, 0.000191702, 
     0.000560035, 0.000384056, 0.000454783, 0.000723385, 0.000203897, 
     0.000973337, 0.000822171, 0.000620526, 0.000260769, 0.000214607, 
     0.002077443, 0.00065843, 0.000403672, 0.000378651, 0.000409306, 
     0.001722587, 0.000213785, 0.000176643, 0.002022878, 0.001886929, 
     0.053029236, 0.022594965, 0.011967636, 0.026851113, 0.03773798, 
     0.031356268, 0.10410326, 0.063265216, 0.018028454, 0.116038001, 
     0.00572817, 0.053635968, 0.059126941, 0.011835241, 0.004639624, 
     0.014302911, 0.082948853, 0.015202238, 0.021295431, 0.043342, 
     0.008153675, 0.015613747, 0.043289609, 0.048834321, 0.019144763, 
     0.059809871, 0.006990685, 0.04082966, 0.02986135, 0.061405171, 
     0.006142619, 0.009767602, 0.035427993, 0.03729329, 0.01309739, 
     0.00221718, 0.040211393, 0.006303841, 0.030146612, 0.032033879, 
     0.024590398, 0.077991721, 0.017215666, 0.014731147, 0.04802582, 
     0.03168714, 0.03244771, 0.032278613, 0.017301885, 0.013450667, 
     0.040207755, 0.042669615, 0.03456749, 0.034631319, 1.93e-05, 
     4.72e-06, 5.41e-05, 0, 1.91e-05, 9.33e-07, 5.98e-06, 0, 1.05e-06, 
     4.1e-07, 7.72e-05, 4.07e-07, 0.000585154, 0.000246992, 7.86e-06, 
     3.13e-06, 2.14e-06, 7.56e-06, 9.29e-05, 0.000116024, 5.51e-05, 
     7.79e-06, 6.65e-06, 2.06e-06, 0.000104342, 4.16e-06, 1.27e-05, 
     0.000197502, 0.00015135, 0.000107306, 6.54e-05, 0.000225564, 
     0.000142631, 0.000168873, 3.5e-05, 0.000365242, 0.000174254, 
     0.000339327, 8.7e-05, 0.000136679, 0.000156634, 0.000224181, 
     0.000205305, 8.87e-05, 0.000305774, 0.000133615, 0.00015118, 
     0.000107229, 0.000162579, 0.000152249, 6.88e-05, 0.000113864, 
     0.000249258, 0.00024256, 0.00079296, 0.007640951, 0.004937327, 
     0.000422361, 0.000953513, 0.000951187, 0.000671306, 0.001106406, 
     0.002606568, 0.003006867, 0.001911646, 0.00135411, 0.012461738, 
     0.000434917, 0.00237646, 0.007857561, 0.000436889, 0.00048816, 
     0.000348146, 0.000931449, 0.000323974, 0.004945321, 0.000693845, 
     0.000479572, 0.000843415, 0.001419675, 0.001547478, 8.16e-05, 
     6.63e-05, 0.000101583, 3.08e-05, 0.000147039, 5.13e-05, 0.000109479, 
     2.39e-05, 0.000225475, 4.28e-05, 0.000230785, 2.1e-05, 0.0001356, 
     0.000124173, 0.000245128, 0.000275446, 3.18e-05, 0.00017516, 
     0.000180192, 0.000246669, 0.000378708, 4.35e-05, 0.000267824, 
     7.2e-05, 7.65e-05, 8.79e-05, 0.000130026, 0.000111462, 3.17e-05, 
     0.000200096, 3.12e-06, 8.75e-05, 3.11e-06, 6.89e-06, 0.000165936, 
     5.98e-05, 0.000201355, 5.92e-06, 2.57e-05, 2.53e-05, 3.27e-05, 
     0.000137446, 0.000134402, 5.86e-07, 3.9e-05, 0.018886909, 0.050343466, 
     4.15e-05, 1.67e-05, 0.000172614, 4.95e-05, 1.27e-05, 9.85e-05, 
     4.28e-05, 0.002708402, 0.003215586, 0.00457116, 0.001713549, 
     0.024353184, 0.006660748, 0.003198887, 0.003094386, 0.004789163, 
     0.002816955, 0.021587313, 0.002084725, 0.00378062, 0.021751495, 
     0.009097143, 0.012216225, 0.001125765, 0.013043534, 0.005514773, 
     0.008323962, 0.026898764, 0.002149135, 0.008021623, 0.006673567, 
     0.005391139, 0.018578559, 0.013786297, 0.00080595, 0.001289505, 
     0.002451416, 0.000234107, 0.001694733, 0.000288175, 0.002357478, 
     0.000856129, 0.00159752, 0.000117538, 0.000166581, 0.000367288, 
     0.001039841, 0.001779528, 0.000438092, 0.001012515, 0.000529936, 
     0.003193086, 0.002562702, 0.00277401, 0.003013136, 0.001349197, 
     0.001646296, 0.001114222, 0.001207882, 0.002804949, 0.000366419
     ), .Dim = c(27L, 13L), .Dimnames = list(c("2", "0", "0", "0", 
    "1", "0", "0", "1", "1", "1", "2", "0", "0", "1", "2", "2", "1", 
    "2", "2", "2", "2", "1", "1", "2", "2", "0", "0"), c("Gene1", 
    "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7", "Gene8", 
    "Gene9", "Gene10", "Gene11", "Gene12", "Gene13")))
jaria20
  • 61
  • 7
  • Does this answer your question? [how to use the Box-Cox power transformation in R](https://stackoverflow.com/questions/33999512/how-to-use-the-box-cox-power-transformation-in-r) – emilliman5 Feb 25 '20 at 22:20
  • @emilliman5 I have seen this, thanks for providing the link. Unfortunately, I need to convert my matrix of data into a linear model class ( might need to use model.matrix) in order for the Box-Cox function in r to work. Not sure how to do this currently. – jaria20 Feb 25 '20 at 23:01
  • What the answers in that post are trying to say is that the box-cox transform is meant for transforming the response variable of your model, after you realize that your data violates the assumptions of linear models beyond your comfort level. However you can transform dependent and independent variables by fitting `boxcox(lm(variable ~ 1, data=df))` for each column of your matrix, capturing the optimal lambda and then applying the box-cox transform (this is in a comment of the most upvoted answer) – emilliman5 Feb 25 '20 at 23:11
  • @emilliman5 The only trouble is whilst the above just shows 13 genes, I have 790 columns in my full data set, and about 192 rows ( corresponding to patients). It would be time-consuming to find the optimal lambda for each of these and then I'm assuming take the average. – jaria20 Feb 25 '20 at 23:53
  • Hey @jaria20, in your data frame, every column has values in different scales, which can cause some problems depending on what you want to achieve with the pca. If the pca will have something to do with the disease, you should fit it to the disease. And you can get a common lambda by simply doing boxcox(as.matrix(x[,-ncol(x)]) ~ x[,ncol(x)]), assuming x is your data frame and the last column is your phenotype – StupidWolf Mar 08 '20 at 23:40
  • Boxcox has its limitations, for example it cannot take negative values and i see some in your data frame. So my suggestion is to really think about what are you transforming and what you want to see, instead of thumping these models in... – StupidWolf Mar 08 '20 at 23:41

1 Answers1

2

Caret might make this a lot easier.

Creating your data structure

data <- structure(c(5.54e-05, 5.58e-06, 9.74e-05, 1.33e-06, 1.29e-05, 
            7.22e-06, 0.000215899, 3.6e-06, 0.000146724, 1.53e-05, 0.000913187, 
            1.9e-06, 0.007421464, 0.000648006, 5.1e-06, 6.15e-06, 4.73e-06, 
            0.000119899, 0.000884487, 0.000850632, 0.000236607, 7.36e-06, 
            8.48e-06, 2.63e-05, 0.001368493, 1.12e-05, 0.000177568, 0.006338532, 
            0.006162866, 0.040695132, 0.013255055, 0.033086619, 0.074158811, 
            0.004967497, 0.01247423, 0.043201417, 0.011470285, 0.038447751, 
            0.018825124, 0.027701807, 0.063373762, 0.005374513, 0.048876252, 
            0.009959848, 0.004434078, 0.004176856, 0.015288913, 0.060226053, 
            0.05128922, 0.006557554, 0.017460326, 0.007684784, 0.002107577, 
            0.005773192, 0.076186393, 0.037631043, 0.052159393, 0.012179365, 
            0.047199766, 0.022458838, 0.030261613, 0.00626629, 0.028664896, 
            0.02285845, 0.02801855, 0.017681676, 0.040563592, 0.029791175, 
            0.034778056, 0.019318473, 0.011847912, 0.009614177, 0.064027542, 
            0.035334149, 0.041638955, 0.056015014, 0.03304865, 0.017660205, 
            0.030187166, 0.057919531, 0.029990489, 0.000112884, 0.000920886, 
            0.001081748, 0.000195159, 0.001678445, 0.000171612, 0.000191702, 
            0.000560035, 0.000384056, 0.000454783, 0.000723385, 0.000203897, 
            0.000973337, 0.000822171, 0.000620526, 0.000260769, 0.000214607, 
            0.002077443, 0.00065843, 0.000403672, 0.000378651, 0.000409306, 
            0.001722587, 0.000213785, 0.000176643, 0.002022878, 0.001886929, 
            0.053029236, 0.022594965, 0.011967636, 0.026851113, 0.03773798, 
            0.031356268, 0.10410326, 0.063265216, 0.018028454, 0.116038001, 
            0.00572817, 0.053635968, 0.059126941, 0.011835241, 0.004639624, 
            0.014302911, 0.082948853, 0.015202238, 0.021295431, 0.043342, 
            0.008153675, 0.015613747, 0.043289609, 0.048834321, 0.019144763, 
            0.059809871, 0.006990685, 0.04082966, 0.02986135, 0.061405171, 
            0.006142619, 0.009767602, 0.035427993, 0.03729329, 0.01309739, 
            0.00221718, 0.040211393, 0.006303841, 0.030146612, 0.032033879, 
            0.024590398, 0.077991721, 0.017215666, 0.014731147, 0.04802582, 
            0.03168714, 0.03244771, 0.032278613, 0.017301885, 0.013450667, 
            0.040207755, 0.042669615, 0.03456749, 0.034631319, 1.93e-05, 
            4.72e-06, 5.41e-05, 0, 1.91e-05, 9.33e-07, 5.98e-06, 0, 1.05e-06, 
            4.1e-07, 7.72e-05, 4.07e-07, 0.000585154, 0.000246992, 7.86e-06, 
            3.13e-06, 2.14e-06, 7.56e-06, 9.29e-05, 0.000116024, 5.51e-05, 
            7.79e-06, 6.65e-06, 2.06e-06, 0.000104342, 4.16e-06, 1.27e-05, 
            0.000197502, 0.00015135, 0.000107306, 6.54e-05, 0.000225564, 
            0.000142631, 0.000168873, 3.5e-05, 0.000365242, 0.000174254, 
            0.000339327, 8.7e-05, 0.000136679, 0.000156634, 0.000224181, 
            0.000205305, 8.87e-05, 0.000305774, 0.000133615, 0.00015118, 
            0.000107229, 0.000162579, 0.000152249, 6.88e-05, 0.000113864, 
            0.000249258, 0.00024256, 0.00079296, 0.007640951, 0.004937327, 
            0.000422361, 0.000953513, 0.000951187, 0.000671306, 0.001106406, 
            0.002606568, 0.003006867, 0.001911646, 0.00135411, 0.012461738, 
            0.000434917, 0.00237646, 0.007857561, 0.000436889, 0.00048816, 
            0.000348146, 0.000931449, 0.000323974, 0.004945321, 0.000693845, 
            0.000479572, 0.000843415, 0.001419675, 0.001547478, 8.16e-05, 
            6.63e-05, 0.000101583, 3.08e-05, 0.000147039, 5.13e-05, 0.000109479, 
            2.39e-05, 0.000225475, 4.28e-05, 0.000230785, 2.1e-05, 0.0001356, 
            0.000124173, 0.000245128, 0.000275446, 3.18e-05, 0.00017516, 
            0.000180192, 0.000246669, 0.000378708, 4.35e-05, 0.000267824, 
            7.2e-05, 7.65e-05, 8.79e-05, 0.000130026, 0.000111462, 3.17e-05, 
            0.000200096, 3.12e-06, 8.75e-05, 3.11e-06, 6.89e-06, 0.000165936, 
            5.98e-05, 0.000201355, 5.92e-06, 2.57e-05, 2.53e-05, 3.27e-05, 
            0.000137446, 0.000134402, 5.86e-07, 3.9e-05, 0.018886909, 0.050343466, 
            4.15e-05, 1.67e-05, 0.000172614, 4.95e-05, 1.27e-05, 9.85e-05, 
            4.28e-05, 0.002708402, 0.003215586, 0.00457116, 0.001713549, 
            0.024353184, 0.006660748, 0.003198887, 0.003094386, 0.004789163, 
            0.002816955, 0.021587313, 0.002084725, 0.00378062, 0.021751495, 
            0.009097143, 0.012216225, 0.001125765, 0.013043534, 0.005514773, 
            0.008323962, 0.026898764, 0.002149135, 0.008021623, 0.006673567, 
            0.005391139, 0.018578559, 0.013786297, 0.00080595, 0.001289505, 
            0.002451416, 0.000234107, 0.001694733, 0.000288175, 0.002357478, 
            0.000856129, 0.00159752, 0.000117538, 0.000166581, 0.000367288, 
            0.001039841, 0.001779528, 0.000438092, 0.001012515, 0.000529936, 
            0.003193086, 0.002562702, 0.00277401, 0.003013136, 0.001349197, 
            0.001646296, 0.001114222, 0.001207882, 0.002804949, 0.000366419
), .Dim = c(27L, 13L), .Dimnames = list(c("2", "0", "0", "0", 
                                          "1", "0", "0", "1", "1", "1", "2", "0", "0", "1", "2", "2", "1", 
                                          "2", "2", "2", "2", "1", "1", "2", "2", "0", "0"), c("Gene1", 
                                                                                               "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7", "Gene8", 
                                                                                               "Gene9", "Gene10", "Gene11", "Gene12", "Gene13")))

And transform your data.

library(caret)

#estimate a Box–Cox transformation 
preProcessValues <- preProcess(data, method = "BoxCox")

#transform data
dataBC <- predict(preProcessValues, data)
stefanH
  • 333
  • 1
  • 8
  • The above doesn't normalise the data ( as confirmed through histograms and the Anderson-Darling test). – jaria20 Mar 07 '20 at 16:46
  • since this is just expression data you should just go with the industry standard of a log2 transformation. and then continue with the assumption of normality. yeah it wont be perfect but that is what happens with real life data. – stefanH Mar 09 '20 at 23:02