44

sperm factoid [Source]

In an episode of the BBC show QI - Quite Interesting (Series J, Episode 1) Stephen Fry said:

How much information do you think is in the DNA of one little sperm...?

It's 37.5 MB...

...a normal male ejaculation, if there is such a thing, is equivalent of 15,875 GB. That's about 7500 laptops worth of information...


The shows Twitter page summarized it:

A sperm has 37.5 MB of DNA info.
One ejaculation transfers 15,875 GB of data, equivalent to that held on 7,500 laptops.


(with "200 million sperm per ejaculation" one would actually get 7150 TB;
but I'm more interested in where the 37.5 MB number comes from)


My Question:

  • Does the DNA of one sperm contain 37.5 MB of information?
Oliver_C
  • 47,851
  • 18
  • 213
  • 208
  • 15
    I was very annoyed when this was broadcast, as the claim of 7500TB is clearly false. Each sperm is approximately a random shuffle of 1/2 the DNA of the parent and so 200 million selections of 1/2 the parental DNA is not going to multiply the information contained by 200 million! The 37.5MB sounds reasonable order of magnitude, exact values will depend on how you encode the information etc. I would crunch the numbers myself but would that be acceptable as an answer? – Nick Sep 18 '12 at 10:39
  • 2
    Wikipedia seems to claim a figure between 2MB (haploid difference from standard reference) and 700-800MB (complete haploid genome). Not sure what set of approximations QI used to get 37.5MB. – Nick Sep 18 '12 at 10:55
  • If it's _2 bytes per base pair_ and there are 3.2 billion base pairs, then it would be 763 MB. Makes me wonder where "37.5" comes from. – Oliver_C Sep 18 '12 at 11:22
  • @Olicer_C the entropy in DNA is less than 2 bits per base pair, ~1.75 but that only lowers it to 625MB-667MB. If you take just differences from human reference sequence, you can get down to 2MB. 37.5MB seems rather odd size. – Nick Sep 18 '12 at 11:28
  • 6
    "equivalent of 15,875 GB" is total BS, exact copy of same information is not extra information. It's like saying that copying "lorem ipsum" few thousand times is "equivalent to contents of the Library of Congress" – vartec Sep 18 '12 at 11:47
  • 1
    It depends on the encoding used and the contents of the DNA. I can create an encoding whereby if the first bit is 1, then the DNA is my DNA, if it's 0 then the DNA follows, and in that case it would be 1 bit for my DNA and more for the others. So the answer is: it depends – Andreas Bonini Sep 18 '12 at 11:52
  • @coleopterist Dr.Frank Scali has divided the 763MB figure down to only include protein coding DNA, (which is ~1% of the genome.) I don't think that is a good approach to information transfer, since the rest of the genome contains functional regulatory elements (See the ENCODE project http://genome.ucsc.edu/ENCODE/) which estimates 80% is functional. However, 100% of the bases are readable as information. He ahas also naively multiplied up by MB/sperm to get total data, which is incorrect, as the sperm all share the same parental cell DNA – Nick Sep 18 '12 at 13:13
  • This really depends upon the encoding scheme, if you are using ASCII to encode then you are talking 8 bits for base pair which is going to up the numbers a lot. A lot of genomics data is passed around as ASCII so it's not unreasonable to use it for the calculation either. – rjzii Sep 18 '12 at 13:16
  • 1
    I'd just like to point out that even though a lot of data is repeating in the calculation, physically speaking, it is still passed down (you cannot do such thing as compress sperm DNA information). – Zonata Sep 18 '12 at 13:22
  • 6
    @RobZ: Depends if they really talk about *information* or *data*. In coding theory the information content won't be changed by a lossless encoding scheme. It just determines how much data you need to represent it. Of course, 200 million times the same messages gives you 200 million times the data size, but no more additional information. – Martin Scharrer Sep 18 '12 at 18:12
  • 1
    Of course it is absolutely naïve to calculate an amount of information from the length of the DNA of a cell. Information stored in DNA does not barely depend on its sequence, but it depends on WHERE the sequence is, how the DNA is spatially folded, whether it is modified (e.g. methylated), what transcription factors and what proteins are present in that specific cell and so on. These type of DNA/computer comparisons -albeit very common- are just (pointless IMO) exercises in style. – nico Sep 18 '12 at 18:26
  • I've heard the sperm count of humans is poor compared to other animals. – Andrew Grimm Sep 19 '12 at 02:45
  • 2
    @Vartec - Actually its like saying that you could copy Lorum Ipsum enough times to fill a datastore with the same volume of information as contained in the LOC. There is no claim that the data being transferred is not redundant at all. – Chad Sep 19 '12 at 15:39
  • @Chad: the quoted question is "How much **information**", more copies isn't the same as more information. – vartec Sep 19 '12 at 15:43
  • 1
    @Vartec - the question is only about one sperm an entire ejaculate... however I would still say that even though it is mostly the same information over and over you can not know what it is until you read it, and you can read each one, so each one is infact a unique copy of information. If you down load the same 1mb file 1024 times it is still 1g of data that was downloaded. If it were a pointer to the information then I would agree. – Chad Sep 19 '12 at 15:50
  • I would not qualify the transport of a data storage unit as data transfer... – inf3rno May 30 '13 at 07:39
  • I cant even understand? Who pulled out this 37mb out of the air and was like yeah its the equivelent...blaahhhh, how do you even convert it into comouter data lol?? –  Jul 30 '13 at 21:32
  • I think the real question here is: why can't we use DNA to encode information? In other words, store 37MB of our own data by creating an artificial sperm or modifying an existing one. If we modifying existing ones, the, uh, large supply should make, uh, "disk drives" much cheaper and make the "dick drive" pun finally real :) –  May 14 '18 at 19:31

2 Answers2

46

I am not sure where these numbers come from and the answer depends on how you encode the genome data and if you define all the redundancy (unnecessary, repetitive data) as "information".

First of all, the humane genome contains somewhere around 3.1 (men) to 3.2 (women) billion base pairs. Since the X chromosome is three times longer than the Y chromosome, women have a higher total genome length than men.

Source: "Human Genome Assembly Information" from the "Genome Reference Consortium"

A base pair is made of two of the four nucleobases adenine, cytosine, guanine and thymine, but only the four combinations AT, TA, CG and GC are possible as the A and T nucleobases won't bond with the C and G nucleobases and vice versa. These four combinations can be encoded with two bits, so that 6.2-6.4 gigabits or about 750 megabytes are required to store an exact copy of the genome.

Now, even if you need 750 megabytes to store the "raw data" from a human genome, at least a computer scientist will have a hard time defining all of this as "information". E.g. if you record 74 Minutes of complete silence on a CD, the disc contains roughly 750 megabytes of "data" as well, but actually no "information". Large parts of the human genome are repetitive, only a very small part actually differ between different individuals and from the difference, several base pair sequences only occur in a few well-defined varieties.

There is actually some research in the field "how to store a human genome as compact as possible", since genome databases most likely are going to expand rapidly and scientists need efficient ways to share data. Some tools are available for this purpose, e.g. DNAzip, which using a ~5 gigabyte dictionary (permanent data) can compress a human genome down to roughly 4 megabytes.

Source: "Human genomes as email attachments"

Tor-Einar Jarnbjo
  • 6,460
  • 1
  • 37
  • 34
  • CAG and T are *nucleotides*, not proteins. Proteins are long strings of amino acids; nucleotides are small cyclic molecules. – matt_black Sep 19 '12 at 15:09
  • @matt_black: Aren't they actually nucleobases, to be very precise? – Tor-Einar Jarnbjo Sep 19 '12 at 16:32
  • 1
    @Tor-EinarJarnbjo: A, C, G and T can be used to identify both the nucleobase (for instance adenine) and the nucleoside (for instance adenosine). – nico Sep 19 '12 at 17:37
  • 8
    The second number is interesting but not really an answer to the question: the *information content* is certainly more than 4 MB since you can’t just ignore the dictionary size. – Konrad Rudolph Sep 19 '12 at 22:51
  • The correct information content is comparable to the size of the genome, about 1Gbyte. There is only a small factor of redundant or useless information. – Ron Maimon Sep 20 '12 at 02:46
  • __Speculation:__ _37.5MB_ is _5%_ of _750MB_. Why 5%? Until [recently](http://www.personal.psu.edu/dhk3/blogs/DoubleHelixLaw/2012/09/trashing-junk-dna-the-notorious-80.html) it was believed that most of our DNA is "junk", and I often heard that [95% was junk](http://www.abc.net.au/catalyst/stories/s898887.htm). So whoever came up with "37.5MB" might have dismissed 95% of the 750MB as _non-information_. – Oliver_C Sep 20 '12 at 10:10
  • @RonMaimon No, it’s substantially less. Maybe not 37MB (I don’t remember where this number comes from but it’s frequently quoted in bioinformatics – maybe Oliver is right but I doubt it: most scientists have known quite long that “junk DNA” isn’t holding up to scrutiny). Nevertheless, DNA contains quite a few low-complexity regions and can be compressed down to at least 700 MB. – Konrad Rudolph Sep 20 '12 at 12:04
  • 2
    I have to say that I’m unhappy that this is the accepted answer. The 37 MB number is in the ballpark of often-quoted numbers in bioinformatics. Whether or not it’s correct it requires some explanation, and this is entirely lacking here. Unfortunately, I can’t for the life of me remember how the number was derived. – Konrad Rudolph Sep 20 '12 at 12:08
9

For a simpler answer, you can just look at the size of an ASCI encoded text file containing the human genome's information. This, of course, is not the information content of the genome which, as you can see from the answer above and the comments in this thread, is not that easy to define.

In any case, when biologists work on the genome sequence, it tends to be in the form of FASTA sequences. The human genome as a multi fasta file is ~3Gb. See, for example, the file UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa obtained when extracting this archive.

Again, I stress that this is not the information content of the genome. For those of us who are not information theorists though, it gives an easy way of picturing the genome's size in a format we are familiar with: text.

terdon
  • 458
  • 5
  • 15