Declaring a new data type for DNA

Question

I am involved with biology, specifically DNA and often there is a problem with the size of the data that comes from sequencing a genome.

For those of you who don't have a background in biology, I'll give a quick overview of DNA sequencing. DNA consists of four letters: A, T, G, and C, the specific order of which determines what happens in the cell.

A major problem with DNA sequencing technology however is the size of the data that results, (for a whole genome, often much more than gigabytes).

I know that the size of an int in C varies from computer to computer, but it still has way more information storage possibility than four choices. Is there a way to define a type/way to define a 'base' that only takes up 2 or 3 bits? I've searched for defining a structure, but am afraid this isn't what I'm looking for. Thanks.

Also, would this work better in other languages (maybe higher level like java)?

A base can be defined as A, T, G or C, meaning it only requires 2 bits per base to encode. Thus a byte (which is 8 bits) could hold 4 bases. If you're willing to do some bitwise manipulation then you can achieve this in most languages. When the genome is being stored out of memory, it's suitable for compression - I imagine a lot of patterns repeat in a full genome, making it highly compressible (but still very large). — Chris Hayes, Jun 25 '14 at 03:35
@ChrisHayes: Of course this means your search function has to support searching at bit offsets within bytes. You can write a custom strstr-like function for doing that but you don't have existing tools like strstr at your disposal. — R.. GitHub STOP HELPING ICE, Jun 25 '14 at 03:52

score 2 · Answer 1 · answered Jun 25 '14 at 03:30

2

Can't you just stuff two ATGC sets into one byte then? Like:

0 1 0 1 1 0 0 1
A T G C A T G C

So this one byte would represent TC,AC?

answered Jun 25 '14 at 03:30

Brandon Prudent

313
3
8

That's not an efficient way; how can you represent AAAA in one byte? – user253751 Apr 06 '16 at 21:27
Well you can't in this setup - it would cost two. Still better than the four it would normally cost. If you know something about the frequency of the sets that might help in determining a better solution though. – Brandon Prudent Apr 09 '16 at 13:59
More to the point, it's a variable-length coding. You can store ATGCATGC (8 bases!) in one byte, but not AAA (only 3 bases). Whereas the simpler 2-bit-per-base coding gives you a fixed rate of 4 bases per byte. – user253751 Apr 11 '16 at 00:32

score 1 · Answer 2 · answered Jun 25 '14 at 03:30

If you want to use Java, you're going to have to give up some control over how big things are. The smallest you can go, AFAIK, is the byte primitive, which is 8 bits (-128 to 127).

Although I guess this is debatable, it seems like Java is more suitable for broad systems control rather than fast, efficient nitty-gritty detail work such as you would generally do with C.

If there is no requirement that you hold the entire dataset in memory at once, you might even try using a managed database like MySQL to store the base information and then read that in piece by piece.

Java works just fine for the nitty-gritty too as it has all the tools on board that a C or C++ also has. And on top of that if you really do need to go native for some things,you have JNA to do it. Where it comes to biology and scientific applications Java is a really good fit because of all the existing third party stuff to visualize data and such. — Gimby, Jun 25 '14 at 08:08

Geseft · Answer 3 · 2014-06-25T08:03:06.440

If I would write a similiar code, I would store the nucleotid identifier in a byte, where you can add 1,2,3,4 as values for A,T,G,C. Even if you will consider that you will use RNA then you can just add a 5th element, with value 5 for U. If you are really digging yourself into the project, I would recommend making a class for codons. In this class you can specify if this is an intron/exon, a Start or Stop codon and so on. And on top of this, you can make a gene class, where you can specify the promoter regions and etc.

If you will have big sequences of dna, rna, and it will need a lot of computing than I strongly recommend to use C++ and for scientific computations Fortrain. ( The total human genom is 1.4 Gb)

Also because there are much repetitive sequences, structuring the genom into codons is usefull, this way you save a lot of memory (you just have to make a refrence to a codon class, and do not have to build the class N times).

Also strucuring into codons, you can predefine your classes, and there is only 64 of them, so your whole genom would be only an ordered referencing list. So in my opinion making a codon as a base unit is much more efficient.

score 1 · Answer 4 · answered Jul 01 '14 at 09:22

1

Below link is one of my research paper, Checkout and let me know if you need more details about implementation if you find it useful for you.

GenCodeX - Kaliuday Balleda

answered Jul 01 '14 at 09:22

Kaliuday

110
8

score 0 · Answer 5 · answered Jun 25 '14 at 03:31

Try a char datatype.

They are generally the smallest addressable memory unit in C\C++. Most systems I've used have it at 1 Byte.

The reason you can't use anything like one or two bits is because the CPU is already pulling in that extra data.

Take a look at this for more details

score 0 · Answer 6 · edited May 23 '17 at 11:50

The issue is not just which data type will hold the smallest value, but also what is the most efficient way to access bit-level memory.

With my limited knowledge I might try setting up a bit-array of ints (which are, from my understanding, the most efficient way to access memory for bit-arrays; I may be mistaken in my understanding, but the same principles apply if there is a better one), then using bit-wise operators to write/read.

Here are some partial codes that should give you an idea of how to proceed with 2-bit definitions and a large array of ints. Assuming a pointer (a) set to a large array of ints:

unsinged int *a, dna[large number];
a = dna;
*a = 0;

Setting up bit definitions:

For A:

da = 0;
da = ~da;
da = da << 2;
da = ~da; (11)

For G:

dg = 0;
dg = ~dg;
dg = dg << 1;
dg = ~dg;
dg = dg << 1; (10);

and so on for T and C

For the loop:

while ((b  = getchar())!=EOF){

i = sizeof(int)*8;    /*bytes into bits*/

if (i-= 2 > 0){       /*keeping track of how much unused memory is left in int*/
    if (b =='a' || b == 'A')
        *a = *a | da;
    else if (b == 't' || b == 'T')
        *a = *a | ta;
    else if (t...
    else if (g...
    else
        error;
    *a = *a << 2;
} else{
    *++a = 0; /*advance to next 32-bit set*/
    i = sizeof(int)*8     /* it may be more efficient to set this value aside earlier, I don't honestly know enough to know this yet*/
    if (b == 'a'...
    else if (b == 't'...
    ...
    else
        error;
    *a = *a <<2;
}
}

And so on. This will store 32 bits for each int (or 16 of letters). For array size maximums, see The maximum size of an array in C.

I am speaking only from a novice C perspective. I would think that a machine language would do a better job of what you are asking for specifically, though I'm certain there are high-level solutions out there. I know that FORTRAN is a well-regarded when it comes to the sciences, but I understand that it is so due to its computational speed, not necessarily because of its efficient storage (though I'm sure it's not lacking there); an interesting read here: http://arstechnica.com/science/2014/05/scientific-computings-future-can-any-coding-language-top-a-1950s-behemoth/. I would also look into compression, though I sadly have not learned much of it myself.

A source I turned to when I was looking into bit-arrays: http://www.mathcs.emory.edu/~cheung/Courses/255/Syllabus/1-C-intro/bit-array.html

Declaring a new data type for DNA

6 Answers6

Linked