How to compress an alphabet consisting of DNA sequence

Question

I want to compress a DNA sequence with a compression technique rather than Huffman and Adaptive Huffman algorithm, i'm using c# as a programming language. can anyone lead me to an algorithm. Note: I want a lossless compression

DNA contains lots of base sequence repeats. Any compression method with a dictionary will work well. Like Deflate. — Hans Passant, Dec 19 '11 at 18:22
You could adapt this [C++ LZW example](http://marknelson.us/2011/11/08/lzw-revisited/), I did recently and it worked very well. — user7116, Dec 19 '11 at 18:40
@HansPassant: yes but i want to use the minimum average length of the code to rise up the copression ratio — Sara S., Dec 20 '11 at 20:10

yas4891 · Accepted Answer · 2011-12-19T18:35:03.140

6

With DNA sequences you have 4 possible states, namely

Guanine (G, 00)
Cytosine (C, 01)
Adenine (A, 10)
Thymine (T, 11)

You can use two bits to store those four possible states with the values in brackets. With this simple method you will be able to store four distinct values in one byte.

Update
As @kol mentioned you could then use practically any compression algorithm to further shrink the data. Currently .NET ships with two compression methods (Deflate and GZip) and more can be found in the SharpZipLib open source library

edited Dec 19 '11 at 18:35

answered Dec 19 '11 at 18:16

yas4891

4,774
3
34
55

2

+1 After this encoding, the resulting byte array can be compressed by a lossless compression algorithm. Check out System.IO.Compression: http://msdn.microsoft.com/en-us/library/3z72378a.aspx – kol Dec 19 '11 at 18:20
@kol Good point. I will incorporate this into the answer as Hans Passant pointed out that DNA contains a lot of repetition. – yas4891 Dec 19 '11 at 18:28

How to compress an alphabet consisting of DNA sequence

1 Answers1