I am trying to represent to genetic variation data in a database for my institution. We have discovered genetic variants, which have associated with them reference alleles, mutant alleles, chromosome, position, name, possible effect, gene, position in gene etc.
Though it's not essential for the question context is sometimes useful, I'll be building this with django, and the db backend will be either PostgreSQL or MySQL (suggestions about choice here also welcome, though not main focus of question)
To represent this information properly I have set about designing a relational database. I'm running into problems defining the most efficient structure however. I could represent it as follows:
Variants belong to genes in a many to one relationship. i.e. one gene can have many variants but one variant cannot usually span more than one gene. (However sometimes this can happen with large CNVs or where two genes overlap, so perhaps a many to many relationship???)
Variants are also discovered in individuals. Individuals have genotypes, which is just two copies of the various combinations of alleles of the variant. I'm not sure about the best for this at all, perhaps a joint primary key of variant and individual and the record the genotype as the number of mutant alleles (0,1,2 for example)???
So my question is (sorry for all the preamble, and bio talk) what way do we thing is the best, or a better design for these three things: Variants - the main thing I want to store information about, and Genes, and Individuals - both essential for any downstream analyses.
Any advice is much appreciated. Again sorry for the somewhat ephemeral nature of the question.