simulation of genetic data in R

Question

I am looking for the best way or best package available for simulating a genetic association between a specific SNP and a quantitative phenotype, with the simulated data being the most similar to my real data, except that I know the causal variant. All of the packages I saw in R seem to be specialised in pedigree data or in population data where coalescence and other evolutionary factors are specified, but I don't have any experience in population genetics and I only want to simulate the simple case of European population with a similar characteristics to my real data (i.e. normal distribution for the trait and an additive effect for the genotype, similar allele frequancies…) So for example if my genetic data is X and my quantitative variable is Y:

X <-rbinom(1000,2,0.4)
Y <- rnorm(1000,1,0.4)

I am looking for something in R similar to the function in Plink where one needs to specify a range of allele frequencies, a range for the phenotype, and specify a specific variant which should result associated with the genotype (this is important because I need to repeat these associations in different datasets with the causal variant being the same)

Can someone please help me?

Also have a look at this question: http://stats.stackexchange.com/questions/62208/simulate-data-for-power-analysis-of-logistic-regresion-model-include-covarianc — Rasmus Larsen, Jun 25 '13 at 07:26

score 1 · Answer 1 · answered Sep 03 '12 at 07:20

1

If the genotype changes only the mean of the phenotype, this is very simple.

phenotype.means <- c(5, 15, 20)  # phenotype means for genotypes 0, 1, and 2
phenotype.sd <- 5
X <- rbinom(1000,2,0.4)
Y <- rnorm(1000, phenotype.means[X], phenotype.sd)

This will lead to Y containing 1000 normally distributed variables, where those with homozygous recessive genotypes (aa, or 0) will have a mean of 5, those with heterozyous genotypes (Aa, or 1) will have a mean of 15, and those with homozygous dominant genotypes (AA, or 2) will have a mean of 20.

If you want a more traditional 2 setting phenotype (AA/Aa versus aa), just set phenotype.means to something like c(5, 20, 20).

answered Sep 03 '12 at 07:20

David Robinson

77,383
16
167
187

Thank you David for your suggestion. I might have oversimplified my real data, I am looking for a way to model my real data closely, so I would like to be able to reproduce the same allele frequencies as are in my data, and I really need to be able to control the specific variant that results associated because I will then need to associate this same variant using another dataset… Is this confusing? thank you again for your help – user1642995 Sep 03 '12 at 07:30
You're welcome. If (and only if) this fully answers your question, you can [accept it](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work). – David Robinson Sep 03 '12 at 07:31
Hi David, I posted a similar question in https://stats.stackexchange.com/questions/463952/simulate-normal-distributed-real-data-phenotypes-from-genotypic-data-in-r. Maybe you could have an idea of this. Thank you in advance. – marb_021 May 01 '20 at 18:44

simulation of genetic data in R

1 Answers1