Relational Database for genetic variation

Question

I am trying to represent to genetic variation data in a database for my institution. We have discovered genetic variants, which have associated with them reference alleles, mutant alleles, chromosome, position, name, possible effect, gene, position in gene etc.

Though it's not essential for the question context is sometimes useful, I'll be building this with django, and the db backend will be either PostgreSQL or MySQL (suggestions about choice here also welcome, though not main focus of question)

To represent this information properly I have set about designing a relational database. I'm running into problems defining the most efficient structure however. I could represent it as follows:

Variants belong to genes in a many to one relationship. i.e. one gene can have many variants but one variant cannot usually span more than one gene. (However sometimes this can happen with large CNVs or where two genes overlap, so perhaps a many to many relationship???)

Variants are also discovered in individuals. Individuals have genotypes, which is just two copies of the various combinations of alleles of the variant. I'm not sure about the best for this at all, perhaps a joint primary key of variant and individual and the record the genotype as the number of mutant alleles (0,1,2 for example)???

So my question is (sorry for all the preamble, and bio talk) what way do we thing is the best, or a better design for these three things: Variants - the main thing I want to store information about, and Genes, and Individuals - both essential for any downstream analyses.

Any advice is much appreciated. Again sorry for the somewhat ephemeral nature of the question.

You realise that this is not a trivial question? A few questions: 1) do you have "samples" that are not yet annotated? 2) do you have "persons" that are not yet classified to have any of the genotypes? 3) do you actually have to store the DNA sequences, or only the "abstract" genotypes (like: "has a type-B locusXYZ") — wildplasser, Jun 24 '12 at 22:50
I'm aware it's a difficult question. To answer your questions in order 1)Variants will be annotated before being inserted into the database, though after initial insertion, batches of data will be released that may contain new annotation as well as new variants, genes, and individuals. 2) Not every person will have a genotype, even after all the sequence analysis is finished, and 3) the DNA sequences don't need to be stored. Thanks — Davy Kavanagh, Jun 24 '12 at 23:51
cross-posted on biostar: http://www.biostars.org/post/show/47587 — Pierre, Jun 25 '12 at 06:43
@DavyKavanagh; I would like to point you to a type of information modelling usually referred to as `fact-type modelling` (FCO_IM, ORM). This methodology allows modellers to work close with subject-metter-experts communicating in natural language. The problem area (universe of discourse, UoD) is presented in a series of simple facts (propositions), each one can be flagged by an expert as true or false. Once done, the model generates sentences back to the subject-expert for evaluation. Anyaway .. see here http://en.wikipedia.org/wiki/Object-Role_Modeling — Damir Sudarevic, Jun 25 '12 at 16:59
(cont..) When finally hapy with the model, the tool will generate tables (ER) or classes (UML) from that. — Damir Sudarevic, Jun 25 '12 at 17:02

score 2 · Accepted Answer · answered Jun 30 '12 at 00:14

Well, I know nothing about genes, nor do I speak bio-lingo. However, I have gathered some propositions from your question and wikipedia and came up with this. Mostly as an exercise in modelling, using FCO approach. So here are some statements, you should be able to flag each one as true or false.

Gene is a name given to some stretches of DNA.
Gene occupies a given position on a chromosome.
Chromosome is a single piece of coiled DNA containing many genes.
Allele is a one of multiple alternative forms of a single gene.
Allele is a gene.
Variant is a DNA sequence.
Variant spans gene.
Variant spans allele.
Genotype is two copies of alleles.
Phenotype is person's observable trait.
Genotype affects phenotype.
One phenotype can be affected by many genotypes.
One genotype may affect many phenotypes.
Person has many genotypes.
Person has many observed phenotypes.
Variant can be discovered in a person.

enter image description here

Wow. Very cool. Thank you very much. You almost every base assumption right. I wouldn't call an allele a gene, nor does a variant span alleles. But seriously dude... speechless at your effort here! — Davy Kavanagh, Jul 01 '12 at 10:35

Relational Database for genetic variation

1 Answers1