-2

I'm representing nucleotides A,C,G,T as 0,1,2,3, and afterwards I need to translate the sequence representing as quaternary to decimal. Is there a way to achieve this in perl? I'm not sure if pack/unpack can do this or not.

lolibility
  • 2,187
  • 6
  • 25
  • 45
  • 1
    Please describe what you want more clearly and show us what you have tried! – Jean Sep 20 '12 at 20:37
  • So you want `205` from `3031`? – ikegami Sep 20 '12 at 20:45
  • Are you planning on having inputs longer than 8 digits? – ikegami Sep 20 '12 at 20:46
  • So, for DNA sequences, like "ACGTTTCGA", I'm going to convert like $dna =~ tr/ACGT/0123/, after converting, the sequence is a string of numbers, but each digit can only be 0-3, not like decimal which can be 0-9. I want to convert this string of numbers into a decimal integer. It's like the relationship between binary and decimal. As simple as that – lolibility Sep 20 '12 at 21:21
  • There is a problem here in that there is no way of keeping the length of the original sequence. `AAAAAAAAAA` is just zero in decimal, no matter how many bases there are in the sequence – Borodin Sep 20 '12 at 21:23
  • 1
    Gotta ask why do you want to do this? – ikegami Sep 20 '12 at 22:39

3 Answers3

1

I've never used it, but it looks like the Convert::BaseN module would be a good choice. Convert::BaseN - encoding and decoding of base{2,4,8,16,32,64} strings

Ron Bergin
  • 1,070
  • 1
  • 6
  • 7
  • You do not quite get me. For the translate part is easy, use code like $mysequence =~ tr/ACGT/0123/; but after translate, the sequence is represented by a string of numbers. but I want to covert them into a integer, since that string of numbers is quaternary, each digit can only be 0-3, unlike decimal, each digit can be 0-9. It's like a relationship of binary and decimal – lolibility Sep 20 '12 at 21:18
  • That packs the string, but then you have to unpack them to decimal, so that's only half of an answer. – ikegami Sep 20 '12 at 21:32
1

It is very simple to calculate a base-4 string to decimal by processing each digit in a loop

Note that, on 32-bit machines, you won't be able to represent a sequence longer than sixteen bases

This code shows the idea

use strict;
use warnings;

print seq2dec('ACGTACGTACGTACGT');

sub seq2dec{
  my ($sequence) = @_;
  my $n = 0;
  for (map {index 'ACGT', $_} split //, $sequence) {
    $n = $n * 4 + $_;
  }
  return $n;
}

output

454761243
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • if my machine is 64 bit, then how long can I represent a sequence? – lolibility Sep 20 '12 at 21:12
  • on a 64-bit platform you can represent 32 bases, but you would need 64-bit Perl installed. If you keep this as a string then you can store a sequence of indefinite length, but then may as well keep it encoded as ACGT characters – Borodin Sep 20 '12 at 21:16
  • There is a problem here in that there is no way of keeping the length of the original sequence. `AAAAAAAAAA` is just zero in decimal, no matter how many bases there are in the sequence – Borodin Sep 20 '12 at 21:24
  • That's Ok for me, all my sequence of DNA of one program run is the same length, so, do not worry about the same integer representing of A,AA,AAA,or A...A – lolibility Sep 20 '12 at 21:27
  • What I care about is the speed of this, not sure if it's efficient if I use loop to do add up and power – lolibility Sep 20 '12 at 21:30
  • You shouldn't optimise speculatively. Write your program as clearly as possible and then tune it if you need better performance – Borodin Sep 20 '12 at 22:26
  • I would like to understand what you are going to do with these encoded sequences. Is there a good reason not to use a standard. `ACTG` string? – Borodin Sep 20 '12 at 22:31
  • I think is easier for cpu to compare difference of integers rather than characters, especially for billions of reads – lolibility Sep 21 '12 at 15:03
  • And what's more, a sequence of DNA with more characters need more space to store, but if you interpret into an integer only case a byte to store then – lolibility Sep 21 '12 at 15:16
  • String comparison is fast because processors usually a hardware string comparison operator. I have done some benchmarks, and it looks like comparing sixteen-character `ACGT` strings is only about 8% slower than comparing the corresponding 32-bit integers. As for the amount of space used, it is generally only useful to compress data if it allows you to fit it *all* into memory. I promise you, the best way is to write a clear, working program and then optimise it as necessary. It is a waste of time to solve imaginary problems, if only because you can't tell whether you've made a difference – Borodin Sep 21 '12 at 16:53
  • If you're looking for duplicate (identical matches), you'd probably use a hash anyway! – ikegami Sep 21 '12 at 17:13
  • @ikegami: I'm not sure whether *billions of reads* means billions of sequences, or 0.5n(n+1) == billions, but a hash of that size may be unmanageable. It looks more like a database problem to me, but either way I don't think packing into binary is a good solution, especially without a control case for comparison – Borodin Sep 21 '12 at 17:23
  • @Borodin, or a database or whatever, but it would be silly to do O(N^2) numerical or string comparisons!!! – ikegami Sep 21 '12 at 17:28
1

Base 4 requires exactly 2 bits, so it's easy to handle efficiently.

my $uvsize = length(pack('J>', 0)) * 8;
my %base4to2 = map { $_ => sprintf('%2b', $_) } 0..3;

sub base4to10 {
   my ($s) = @_;
   $s =~ s/(.)/$base4to2{$1}/sg;
   $s = substr(("0" x $uvsize) . $s, -$uvsize);
   return unpack('J>', pack('B*', $s));
}

This allows inputs of 16 digits on builds supporting 32-bit integers, and 32 digits on builds supporting 64-bit integers.

It's possible to support slightly larger numbers using floating points: 26 on builds with IEEE doubles, 56 on builds with IEEE quads. This would require a different implementation.

Larger than that would require a module such as Math::BigInt for Perl to store them.


Faster and simpler:

my %base4to16 = (
   '0' => '0',   '00' => '0',   '20' => '8',
   '1' => '1',   '01' => '1',   '21' => '9',
   '2' => '2',   '02' => '2',   '22' => 'A',
   '3' => '3',   '03' => '3',   '23' => 'B',
                 '10' => '4',   '30' => 'C',
                 '11' => '5',   '31' => 'D',
                 '12' => '6',   '32' => 'E',
                 '13' => '7',   '33' => 'F',
);

sub base4to10 {
   (my $s = $_[0]) =~ s/(..?)/$base4to16{$1}/sg;
   return hex($s);
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Note: This solution loses information on how many leading zero were present (but the OP said that was ok). – ikegami Sep 20 '12 at 22:12
  • 2
    Added a faster and simpler solution. – ikegami Sep 20 '12 at 22:36
  • It seems function hex($s) cannot deal with length greater than 16, right? – lolibility Sep 21 '12 at 15:43
  • @lolibility, `hex` can't deal with numbers larger that than your Perl can support, and my post already details what numbers your Perl can support. An hour before I posted my answer, I asked how big of a number you need to deal with, and you still haven't answered... Please answer. – ikegami Sep 21 '12 at 16:10
  • Sorry, didn't notice that, I probably going to deal with sequences consists of 32 nucleotides. And my perl and OS information "This is perl, v5.8.8 built for x86_64-linux-thread-multi" – lolibility Sep 21 '12 at 16:35
  • @lolibility, That means Perl is built for a 64-bit machine, but it doesn't say anything of the size of ints. `perl -V:uvsize` will tell you the size of your ints. If it's 8 (64 bits), hex should be able to deal with strings up to 16 digit, which means it should work with 32 base4 digits. – ikegami Sep 21 '12 at 17:12