I'm representing nucleotides A,C,G,T as 0,1,2,3, and afterwards I need to translate the sequence representing as quaternary to decimal. Is there a way to achieve this in perl? I'm not sure if pack/unpack can do this or not.
-
1Please describe what you want more clearly and show us what you have tried! – Jean Sep 20 '12 at 20:37
-
So you want `205` from `3031`? – ikegami Sep 20 '12 at 20:45
-
Are you planning on having inputs longer than 8 digits? – ikegami Sep 20 '12 at 20:46
-
So, for DNA sequences, like "ACGTTTCGA", I'm going to convert like $dna =~ tr/ACGT/0123/, after converting, the sequence is a string of numbers, but each digit can only be 0-3, not like decimal which can be 0-9. I want to convert this string of numbers into a decimal integer. It's like the relationship between binary and decimal. As simple as that – lolibility Sep 20 '12 at 21:21
-
There is a problem here in that there is no way of keeping the length of the original sequence. `AAAAAAAAAA` is just zero in decimal, no matter how many bases there are in the sequence – Borodin Sep 20 '12 at 21:23
-
1Gotta ask why do you want to do this? – ikegami Sep 20 '12 at 22:39
3 Answers
I've never used it, but it looks like the Convert::BaseN module would be a good choice. Convert::BaseN - encoding and decoding of base{2,4,8,16,32,64} strings

- 1,070
- 1
- 6
- 7
-
You do not quite get me. For the translate part is easy, use code like $mysequence =~ tr/ACGT/0123/; but after translate, the sequence is represented by a string of numbers. but I want to covert them into a integer, since that string of numbers is quaternary, each digit can only be 0-3, unlike decimal, each digit can be 0-9. It's like a relationship of binary and decimal – lolibility Sep 20 '12 at 21:18
-
That packs the string, but then you have to unpack them to decimal, so that's only half of an answer. – ikegami Sep 20 '12 at 21:32
It is very simple to calculate a base-4 string to decimal by processing each digit in a loop
Note that, on 32-bit machines, you won't be able to represent a sequence longer than sixteen bases
This code shows the idea
use strict;
use warnings;
print seq2dec('ACGTACGTACGTACGT');
sub seq2dec{
my ($sequence) = @_;
my $n = 0;
for (map {index 'ACGT', $_} split //, $sequence) {
$n = $n * 4 + $_;
}
return $n;
}
output
454761243

- 126,100
- 9
- 70
- 144
-
-
on a 64-bit platform you can represent 32 bases, but you would need 64-bit Perl installed. If you keep this as a string then you can store a sequence of indefinite length, but then may as well keep it encoded as ACGT characters – Borodin Sep 20 '12 at 21:16
-
There is a problem here in that there is no way of keeping the length of the original sequence. `AAAAAAAAAA` is just zero in decimal, no matter how many bases there are in the sequence – Borodin Sep 20 '12 at 21:24
-
That's Ok for me, all my sequence of DNA of one program run is the same length, so, do not worry about the same integer representing of A,AA,AAA,or A...A – lolibility Sep 20 '12 at 21:27
-
What I care about is the speed of this, not sure if it's efficient if I use loop to do add up and power – lolibility Sep 20 '12 at 21:30
-
You shouldn't optimise speculatively. Write your program as clearly as possible and then tune it if you need better performance – Borodin Sep 20 '12 at 22:26
-
I would like to understand what you are going to do with these encoded sequences. Is there a good reason not to use a standard. `ACTG` string? – Borodin Sep 20 '12 at 22:31
-
I think is easier for cpu to compare difference of integers rather than characters, especially for billions of reads – lolibility Sep 21 '12 at 15:03
-
And what's more, a sequence of DNA with more characters need more space to store, but if you interpret into an integer only case a byte to store then – lolibility Sep 21 '12 at 15:16
-
String comparison is fast because processors usually a hardware string comparison operator. I have done some benchmarks, and it looks like comparing sixteen-character `ACGT` strings is only about 8% slower than comparing the corresponding 32-bit integers. As for the amount of space used, it is generally only useful to compress data if it allows you to fit it *all* into memory. I promise you, the best way is to write a clear, working program and then optimise it as necessary. It is a waste of time to solve imaginary problems, if only because you can't tell whether you've made a difference – Borodin Sep 21 '12 at 16:53
-
If you're looking for duplicate (identical matches), you'd probably use a hash anyway! – ikegami Sep 21 '12 at 17:13
-
@ikegami: I'm not sure whether *billions of reads* means billions of sequences, or 0.5n(n+1) == billions, but a hash of that size may be unmanageable. It looks more like a database problem to me, but either way I don't think packing into binary is a good solution, especially without a control case for comparison – Borodin Sep 21 '12 at 17:23
-
@Borodin, or a database or whatever, but it would be silly to do O(N^2) numerical or string comparisons!!! – ikegami Sep 21 '12 at 17:28
Base 4 requires exactly 2 bits, so it's easy to handle efficiently.
my $uvsize = length(pack('J>', 0)) * 8;
my %base4to2 = map { $_ => sprintf('%2b', $_) } 0..3;
sub base4to10 {
my ($s) = @_;
$s =~ s/(.)/$base4to2{$1}/sg;
$s = substr(("0" x $uvsize) . $s, -$uvsize);
return unpack('J>', pack('B*', $s));
}
This allows inputs of 16 digits on builds supporting 32-bit integers, and 32 digits on builds supporting 64-bit integers.
It's possible to support slightly larger numbers using floating points: 26 on builds with IEEE doubles, 56 on builds with IEEE quads. This would require a different implementation.
Larger than that would require a module such as Math::BigInt for Perl to store them.
Faster and simpler:
my %base4to16 = (
'0' => '0', '00' => '0', '20' => '8',
'1' => '1', '01' => '1', '21' => '9',
'2' => '2', '02' => '2', '22' => 'A',
'3' => '3', '03' => '3', '23' => 'B',
'10' => '4', '30' => 'C',
'11' => '5', '31' => 'D',
'12' => '6', '32' => 'E',
'13' => '7', '33' => 'F',
);
sub base4to10 {
(my $s = $_[0]) =~ s/(..?)/$base4to16{$1}/sg;
return hex($s);
}

- 367,544
- 15
- 269
- 518
-
Note: This solution loses information on how many leading zero were present (but the OP said that was ok). – ikegami Sep 20 '12 at 22:12
-
2
-
It seems function hex($s) cannot deal with length greater than 16, right? – lolibility Sep 21 '12 at 15:43
-
@lolibility, `hex` can't deal with numbers larger that than your Perl can support, and my post already details what numbers your Perl can support. An hour before I posted my answer, I asked how big of a number you need to deal with, and you still haven't answered... Please answer. – ikegami Sep 21 '12 at 16:10
-
Sorry, didn't notice that, I probably going to deal with sequences consists of 32 nucleotides. And my perl and OS information "This is perl, v5.8.8 built for x86_64-linux-thread-multi" – lolibility Sep 21 '12 at 16:35
-
@lolibility, That means Perl is built for a 64-bit machine, but it doesn't say anything of the size of ints. `perl -V:uvsize` will tell you the size of your ints. If it's 8 (64 bits), hex should be able to deal with strings up to 16 digit, which means it should work with 32 base4 digits. – ikegami Sep 21 '12 at 17:12