I want to compress a DNA sequence with a compression technique rather than Huffman and Adaptive Huffman algorithm, i'm using c# as a programming language. can anyone lead me to an algorithm. Note: I want a lossless compression
Asked
Active
Viewed 958 times
3
-
1DNA contains lots of base sequence repeats. Any compression method with a dictionary will work well. Like Deflate. – Hans Passant Dec 19 '11 at 18:22
-
1You could adapt this [C++ LZW example](http://marknelson.us/2011/11/08/lzw-revisited/), I did recently and it worked very well. – user7116 Dec 19 '11 at 18:40
-
@HansPassant: yes but i want to use the minimum average length of the code to rise up the copression ratio – Sara S. Dec 20 '11 at 20:10
1 Answers
6
With DNA sequences you have 4 possible states, namely
- Guanine (G, 00)
- Cytosine (C, 01)
- Adenine (A, 10)
- Thymine (T, 11)
You can use two bits to store those four possible states with the values in brackets. With this simple method you will be able to store four distinct values in one byte.
Update
As @kol mentioned you could then use practically any compression algorithm to further shrink the data.
Currently .NET ships with two compression methods (Deflate and GZip) and more can be found in the SharpZipLib open source library

yas4891
- 4,774
- 3
- 34
- 55
-
2+1 After this encoding, the resulting byte array can be compressed by a lossless compression algorithm. Check out System.IO.Compression: http://msdn.microsoft.com/en-us/library/3z72378a.aspx – kol Dec 19 '11 at 18:20
-
@kol Good point. I will incorporate this into the answer as Hans Passant pointed out that DNA contains a lot of repetition. – yas4891 Dec 19 '11 at 18:28