Best option for missing value imputation for prcomp()

Question

I have a data set of genotypes for approximately 200 individual genomes (columns) for nearly 1,000,000 loci (rows). Due to poor sequencing data, most rows contain 1-2 missing genotypes.

If I use

df_new = na.omit(df)

my new data frame contains only a few thousand rows, leading to a much greater loss in data than I would get by imputing one or two missing values per row. I have been looking online for how to use an imputation option in association with na.option with prcomp(), but cannot find an example. I would like to start with the simplest approach, e.g. replacing NA with a median value or something similar.

Could someone please direct me to an example of how to do this in the context of prcomp?

You need to either use a PCA method [which accounts for missingness](https://pubmed.ncbi.nlm.nih.gov/33459779/), or do actual [genotype imputation](https://en.wikipedia.org/wiki/Imputation_(genetics)). Doing R's inbuilt imputation will probably cause a mess. — user438383, Dec 16 '21 at 19:18
Could you please provide a little more detail of why using a naive imputation approach would create problems? My thinking was that since the NA values are rather sparse (on average 1 or 2 per row out of 200, and not in a systematically biased way) that this shouldn't be a major problem. Additionally, could you please direct me to a reference on R's imputation and how to interface this with prcomp? — Max, Dec 16 '21 at 19:20
Tbh, not sure I can say *why* exactly, just that it's unlikely to be a good idea, as there is a reason why specific genotype imputation software is used over naive imputation. What kind of format is your data in - I would strongly advise to use something like plink2, which corrects for missingness. — user438383, Dec 16 '21 at 21:49

score 0 · Accepted Answer · edited Dec 17 '21 at 00:15

0

Now I understand your question, see the sample below:

library(plyr)

   
     ddply(df_new, ~ my_groups, transform,
         missing value column = ifelse(is.na(missing value column), 
                      median(missing value column, na.rm = TRUE), 
                                 missing value column))

  #missing value column is the column that consist the missing value
  #my_groups could be the first column of df_new

I hope this works.

edited Dec 17 '21 at 00:15

Jeremy Caney

7,102
69
48
77

answered Dec 16 '21 at 21:14

Yomi.blaze93

401
3
10

How could this be modified to replace the na not with 0 but with the median value of (non-na) row elements? – Max Dec 16 '21 at 21:29
I just understood your question but let's try the newly updated reply thanks – Yomi.blaze93 Dec 16 '21 at 21:43

Best option for missing value imputation for prcomp()

1 Answers1