0

I applied PCA to my biomedical data (31 genes~rows and 1904 patients~columns) with a selection of 9 components. As a result, I have two sub-matrices in which one is a 9 by 1904 matrix (I call it matrix A).

Matrix A presents its rows are 9 components, its columns are 1904 patients, and its entries are continuous values. Now I want to find out which component out of those 9 components possesses only a single patient out of 1904 patients who accounts for >10% of the variance compared to others (possibly consider this patient as an outlier in this component). At last, I plan to remove these identified components.

For example, I compute variance of patients within each component. Then I realize that Component 3 possesses a patient out of the 1904 patients who accounts for >10% of variance compared to others => I consider that this component includes an outlier. I remove component 3 from my components

I am stuck with doing it in R. Any idea is appreciated! Thanks in advance.

UPDATE: The following are my attempts:

Dummy data df presents 10 patients~rows and 3 components~columns

df=structure(c(-0.17134779227884, -0.0962044733094678, 0.0683562125182872, 
-0.243465849606547, 0.333327443120999, -0.124616446710062, 0.213423949350221, 
-0.086118378436248, 0.209279578622201, 0.425834454279314, 0.16728832317405, 
0.952243725136014, -0.101114176191555, 0.187773366984759, 0.207570066964501, 
-0.117920965767025, 0.939250613987857, -0.00465861655152568, 
-0.288348010784738, 0.0469224124443503, -0.165934907003698, -0.18339647933408, 
-0.098550778268536, -0.094031840482207, 0.0759839405752319, -0.141524045263773, 
-0.0665849661695848, -0.442355221875939, -0.156962689636778, 
-0.142727471861712), .Dim = c(10L, 3L), .Dimnames = list(c("MB-0362", 
"MB-0346", "MB-0386", "MB-0574", "MB-0503", "MB-0641", "MB-0201", 
"MB-0218", "MB-0316", "MB-0189"), c("comp 1", "comp 2", "comp 3"
)))

I try to compute variance of each patient contributes to within each of three components

df1 = as.data.frame(df)
df1$Patients = rownames(df) 
df1 = as.data.frame(df1) %>%
  pivot_longer(-Patients, names_to = "Component", values_to = "Weight") %>%
  group_by(Component) %>%
  mutate(var = var(Weight))

Now I must compute percentage of variance of each patient contribute to each component. The problem that I am stuck with this :(

Huy Nguyen
  • 61
  • 5
  • Do you need `apply(df, 2, var)`? – jay.sf Feb 07 '21 at 13:02
  • Actually I did refer to this solution before posting my question. If I do this: df1 = apply(df,2,var) I will just receive variance of each component across all patients. Then you cannot know specifically which component possesses which patient accounting for >10% of variance compared to others. In other words, you do not specify which component possesses an outlier – Huy Nguyen Feb 07 '21 at 13:11

1 Answers1

0

Wow, perhaps I solved my problem on my own. The following are my solution

df1 = as.data.frame(df)
df1$Patients = rownames(df) 
df1 = as.data.frame(df1) %>%
  pivot_longer(-Patients, names_to = "Component", values_to = "Weight") %>%
  group_by(Component) %>%
  mutate(var = var(Weight)) %>%
  group_by(Patients) %>%
  mutate(percent = var/sum(var) * 100)
Huy Nguyen
  • 61
  • 5