1

I would like to compute a Principal Variance Component analysis which combines a PCA with a mixed effect linear model to identify batch effects. (https://www.niehs.nih.gov/research/resources/software/biostatistics/pvca/index.cfm) I have checked the R script of that website (PCVA is also available as a R package with almost the same code) and can follow the steps until the linear mixed effect model is fitted. After calculating the PCA, the n components are selected and for each a model is fitted separately. I attach the data for the first comp.

The data for the first component look like this (column: pc_data_matrix = eigenvectors):

import pandas as pd 
data = pd.DataFrame(
    [x.split("\t") for x in """columnNames  treat  time  batch  sample  compIdx        pc_data_matrix
0     _23_1CON      0     0      1       1        0  0.296379
1     _23_2CON      0     7      1       2        0 -0.310216
2     _23_3NO0      1     0      1       3        0  0.386527
3     _23_4NO7      1     7      1       4        0 -0.147179
4     _26_1CON      0     0      2       5        0  0.253927
5     _26_2CON      0     7      2       6        0 -0.299453
6     _26_3NO0      1     0      2       7        0  0.380901
7     _26_4NO7      1     7      2       8        0  0.065552
8     CONTROL7      0     7      3       9        0 -0.445159
9     CONTROLZ      0     0      3      10        0  0.188493
10      NO_7_5      1     7      3      11        0 -0.290607
11     NO_ZERO      1     0      3      12        0  0.152242
""".split("\n")])
print(data)

Then in R the model fit is done as such:

    randomEffects <- lmer(pc_data_matrix ~ (1|Time) + (1|Treatment) + (1|Batch) + (1|Time:Treatment) + (1|Time:Batch) + (1|Treatment:Batch), Data, REML = TRUE, verbose = FALSE, na.action = na.omit)))
  print(summary(randomEffects))

printing out

    Linear mixed model fit by REML ['lmerMod']
Formula: pc_data_matrix ~ (1 | Time) + (1 | Treatment) + (1 | Batch) +      (1 | Time:Treatment) + (1 | Time:Batch) + (1 | Treatment:Batch)
   Data: Data

REML criterion at convergence: -12.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-0.6753 -0.3790 -0.1755  0.2801  1.1945 

Random effects:
 Groups          Name        Variance Std.Dev.
 Treatment:Batch (Intercept) 0.003572 0.05977 
 Time:Batch      (Intercept) 0.001616 0.04020 
 Time:Treatment  (Intercept) 0.006390 0.07994 
 Batch           (Intercept) 0.007885 0.08880 
 Treatment       (Intercept) 0.005668 0.07528 
 Time            (Intercept) 0.128193 0.35804 
 Residual                    0.001809 0.04253 
Number of obs: 12, groups:  Treatment:Batch, 6; Time:Batch, 6; Time:Treatment, 4; Batch, 3; Treatment, 2; Time, 2

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.01928    0.26865   0.072

Then in python, I am currently doing this in statsmodels but I know that it is not correct, since I am missing the interactions (treatment:time) (time:batch) (treatment:batch).

import statsmodels.api as sm
import statsmodels.formula.api as smf
#define variance of each random effect as variance component
vcf = {"treat": "treat", "time": "time", "batch": "batch"}
md = smf.mixedlm("pc ~ 1", 
          data = data, 
          vc_formula=vcf,
          groups=data["compIdx"]) #complete data is a single group
            
mdf = md.fit()
print(mdf.summary())

Thanks

xPae
  • 11
  • 1

0 Answers0