I would like to compute a Principal Variance Component analysis which combines a PCA with a mixed effect linear model to identify batch effects. (https://www.niehs.nih.gov/research/resources/software/biostatistics/pvca/index.cfm) I have checked the R script of that website (PCVA is also available as a R package with almost the same code) and can follow the steps until the linear mixed effect model is fitted. After calculating the PCA, the n components are selected and for each a model is fitted separately. I attach the data for the first comp.
The data for the first component look like this (column: pc_data_matrix = eigenvectors):
import pandas as pd
data = pd.DataFrame(
[x.split("\t") for x in """columnNames treat time batch sample compIdx pc_data_matrix
0 _23_1CON 0 0 1 1 0 0.296379
1 _23_2CON 0 7 1 2 0 -0.310216
2 _23_3NO0 1 0 1 3 0 0.386527
3 _23_4NO7 1 7 1 4 0 -0.147179
4 _26_1CON 0 0 2 5 0 0.253927
5 _26_2CON 0 7 2 6 0 -0.299453
6 _26_3NO0 1 0 2 7 0 0.380901
7 _26_4NO7 1 7 2 8 0 0.065552
8 CONTROL7 0 7 3 9 0 -0.445159
9 CONTROLZ 0 0 3 10 0 0.188493
10 NO_7_5 1 7 3 11 0 -0.290607
11 NO_ZERO 1 0 3 12 0 0.152242
""".split("\n")])
print(data)
Then in R the model fit is done as such:
randomEffects <- lmer(pc_data_matrix ~ (1|Time) + (1|Treatment) + (1|Batch) + (1|Time:Treatment) + (1|Time:Batch) + (1|Treatment:Batch), Data, REML = TRUE, verbose = FALSE, na.action = na.omit)))
print(summary(randomEffects))
printing out
Linear mixed model fit by REML ['lmerMod']
Formula: pc_data_matrix ~ (1 | Time) + (1 | Treatment) + (1 | Batch) + (1 | Time:Treatment) + (1 | Time:Batch) + (1 | Treatment:Batch)
Data: Data
REML criterion at convergence: -12.1
Scaled residuals:
Min 1Q Median 3Q Max
-0.6753 -0.3790 -0.1755 0.2801 1.1945
Random effects:
Groups Name Variance Std.Dev.
Treatment:Batch (Intercept) 0.003572 0.05977
Time:Batch (Intercept) 0.001616 0.04020
Time:Treatment (Intercept) 0.006390 0.07994
Batch (Intercept) 0.007885 0.08880
Treatment (Intercept) 0.005668 0.07528
Time (Intercept) 0.128193 0.35804
Residual 0.001809 0.04253
Number of obs: 12, groups: Treatment:Batch, 6; Time:Batch, 6; Time:Treatment, 4; Batch, 3; Treatment, 2; Time, 2
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.01928 0.26865 0.072
Then in python, I am currently doing this in statsmodels but I know that it is not correct, since I am missing the interactions (treatment:time) (time:batch) (treatment:batch).
import statsmodels.api as sm
import statsmodels.formula.api as smf
#define variance of each random effect as variance component
vcf = {"treat": "treat", "time": "time", "batch": "batch"}
md = smf.mixedlm("pc ~ 1",
data = data,
vc_formula=vcf,
groups=data["compIdx"]) #complete data is a single group
mdf = md.fit()
print(mdf.summary())
Thanks